I Introduction
Spatiotemporal data is ubiquitous in many domains, including traffic estimation/prediction, climate modeling, neuroscience, and earth sciences
[cressie2015statistics, banerjee2014hierarchical, atluri2018spatio]. In general, spatiotemporal data characterizes how a particular variable or a group of variables vary in both space and time. For example, sea surface temperature data can help climate researchers understand the effect of climate change and identify abnormal weather patterns; traffic speed data collected from a network of sensors can reveal the evolution of traffic congestion on a highway network over time. A unique property of spatiotemporal data is that the variables often show strong dependencies within/across both the spatial and the temporal dimensions. This property makes the data instances structurally correlated with each other. How to effectively model these dependencies and relations is a critical challenge in spatiotemporal data analysis.This paper focuses on the spatiotemporal kriging problem, of which the goal is to perform signal interpolation for unsampled locations given the signals from sampled locations during the same period. The most prevalent method for kriging is Gaussian Process (GP) regression [williams2006gaussian], which captures the correlation among all data points using a carefully specified kernel/covariance structure. For spatiotemporal problems, one can design an appropriate kernel structure to capture the correlations both within and across the spatial and temporal dimensions (see e.g., [luttinen2012efficient]
). However, GP is computationally very expensive: the time complexity of GP models scales cubically with the number of data points, and learning accurate hyperparameters is very challenging for complex kernel structures with computational issues such as local optima. An alternative to GP on largescale dataset is lowrank matrix/tensor completion
[bahadori2014fast, yu2016temporal, takeuchi2017autoregressive]. These models essentially impose a lowrank assumption to capture the global consistency in the data, and further include regularization structures to also ensure local consistency (e.g., autoregressive regularizer for temporal smoothness [yu2016temporal] and Laplacian regularizer for spatial smoothness [bahadori2014fast]). These models provide a powerful solution for missing data imputation when data is missing at random. However, matrix/tensor completion is essentially transductive
[zhang2019inductive]: for a new spatial location, we have to retrain the full model even in response to only minor changes in the spatial and temporal dimension [wu2020inductive]. In addition, spatiotemporal kriging corresponds to a challenging wholerow/column missing scenario in a spatiotemporal matrix, and thus model accuracy relies heavily on the specification of the local spatial Laplacian regularizer.Recently, deep learning models have opened new doors to spatiotemporal data analysis. In general, spatiotemporal data generated from a sensor network can be naturally modeled as timevarying signals on a spatial graph structure. Numerous neural network architectures for learning on graphs
[KipfW17, defferrard2016convolutional, xu2018powerful] have been proposed in recent years, and graph neural networks (GNN) have been widely applied to modeling spatiotemporal data. Although the spatiotemporal kriging problem can be considered a “forecasting” problem along the spatial dimension, most existing GNNbased studies only focus on the multivariate time series forecasting problem on a fixed spatial graph [yu2018spatio, li2018diffusion, wu2019graph, dai2020hybrid], which cannot be generalized to model unseen locations in spatially varying graphs. Two recent studies have developed GNNbased models for spatiotemporal kriging: Kriging Convolutional Networks (KCN) [appleby2020kriging] and Inductive Graph Neural Networks for Kriging (IGNNK) [wu2020inductive]. Both studies suggest that GNNs are promising tools for the realtime spatiotemporal kriging problem; however, two challenging issues remain when applying the models to diverse realworld applications. The first issue is that all GNNs use a predefined rule to transform spatial information into an adjacency matrix (e.g., using Gaussian kernel as in [wu2020inductive, appleby2020kriging]). We would argue that defining adjacency matrix in GNNs is equally important as defining a spatial kernel in GP, as the predefined rules for constructing the adjacency matrix determines how GNNs transform and aggregate information. Yet, the complex spatial dependencies make it difficult to specify an aggregation function of GNNs that can capture sufficient information of the target datasets [corso2020principal]. To achieve high accuracy, the models require extensive finetuning of hyperparameters the control the predefined rule and the type of aggregation function. Second, both existing GNNbased kriging models do not fully utilize temporal dependencies. For example, KCN [appleby2020kriging] treats observation time as an additional feature for GNNs, and thus neglects the temporal dependencies in spatiotemporal datasets. IGNNK, on the other hand, considers observations over a particular time period as features [wu2020inductive]; as a result, it cannot handle inputs with different sizes of temporal windows.To address the aforementioned limitations, we propose a general framework, Spatial Aggregation and Temporal Convolution Networks (SATCN), for spatiotemporal kriging. We utilize temporal convolutional networks (TCN) to model the temporal dependencies and make our framework flexible on both spatial and temporal dimensions. To address the tuning issue in modeling spatial dependencies, we propose a novel Spatial Aggregation Network (SAN) structure inspired by Principal Neighborhood Aggregation (PNA)—a recent aggregation framework proposed by [corso2020principal]. Instead of performing aggregation based on a predefined adjacency matrix, each node in SAN aggregates the information of its nearest neighbors together with corresponding distance information. In addition, SAN allows for multiple different aggregators in a single layer. To provide SAN with generalization power for kriging tasks, we prevent those missing/unobserved nodes from supplying information in aggregation by masking. Finally, we train SATCN with the objective to reconstruct the full signals from all nodes. Our experiments on largescale spatiotemporal datasets show that SATCN outperforms its deep learning and other conventional counterparts, suggesting that the proposed SATCN framework can better characterize spatiotemporal dependencies in diverse types of data. To summarize, the primary contributions of the paper are as follows:

We design a node masking strategy for realtime kriging tasks. This universal masking strategy naturally adapts to all GNNs for spatiotemporal modeling with missing/corrupted data.

We leverage the temporal dependencies by temporal convolutional networks (TCN). With an inductive training strategy, our model can cope with data of diverse sizes on spatial and temporal dimensions.

We propose a spatial aggregation network (SAN)—an architecture combining multiple message aggregators with degreescalers—to capture the complex spatial dependencies in large datasets.
The remainder of this paper is organized as follows: Section II gives a brief review of the related work, followed by Section III presenting the methodology. Then, Section IV details the empirical studies involving three different realworld datasets. Lastly, Section V concludes this paper and offers some directions of future work.
Ii Related Work
Iia Graph Neural Networks
Graph neural networks (GNNs) are proposed to aggregate information from graph structure. Based on the information gathering mechanism, GNNs can be categorized into spectral approaches and spatial approaches. The essential operator in spectral GNN is the graph Laplacian, which defines graph convolutions as linear operators that diagonalize in the graph Laplacian operator [mallat1999wavelet]. The generalized spectral GNN was first introduced in [bruna2014spectral]. Then, defferrard2016convolutional
proposed to use Chebyshev polynomial filters on the eigenvalues to approximate the convolutional filters. Most of the stateoftheart deep learning models for spatiotemporal data
[yu2018spatio, li2018diffusion, wu2019graph] are based on the concept of Chebynet. In [wu2020inductive], the authors intended to train a spectral GNN for inductive kriging task, indicating the effect of GNN for modeling spatial dependencies. However, in all spectral based approaches, the learned networks are dependent on the Laplacian matrix. This brings two drawbacks to spectral based approaches: 1) They are computationally expensive as the information aggregation has to be made in the whole graph. 2) a GNN trained on a specific structure could not be directly generalized to a graph with a different structure.In contrast, spatial GNN approaches directly perform information aggregation on spatially close neighbors. In general, the commonalities between representative spatial GNNs can be abstracted as the following message passing mechanism [gilmer2017neural]:
(1) 
where
is the feature vector of node
at the th layer, is a set of nodes adjacent to , and and are parameterized functions. Here, is the message of node passing to its neighbors. Each node aggregates messages from their neighboring nodes to compute the next message. Spatial approaches have produced stateoftheart results on several tasks [dwivedi2020benchmarkgnns], and demonstrate the inductive power to generalize the message passing mechanism to unseen nodes or even entirely new (sub)graphs [hamilton2017inductive, velivckovic2018graph]. In [appleby2020kriging], spatial GNNs were applied to kriging task on a static graph. However, this work did not fully consider the temporal dependencies.IiB Deep Learning for Missing Data Imputation
Kriging applications are highly related to the missing data imputation problems. The former is about how to estimate values for unsampled locations, while the latter is a process of filling in missing values for sampled locations. More recently, there has been a surge of interest in applying deep learning techniques to the missing data imputation problems. smieja2018processing
process the missing values of neural network inputs by simply replacing the neuron’s response in the first hidden layer with its expected value, and they also provide mathematical proof for the effectiveness of this strategy. This finding can be served as a theoretical foundation for masking the missing values in the inputs with a certain value. Several studies have developed various models including recurrent neural networks
[cui2020stacked], generative adversarial networks
[yoon2018gain], and variational autoencoders
[nazabal2020handling] to fill in the “masked” inputs. Those approaches ignore the local dependencies (e.g., similarities between adjacent locations). To fill this gap, some approaches have used GNNs to capture the local dependencies [cui2020graph, spinelli2020missing], they have shown the GNNs can effectively capture local dependencies in the presence of missing values. However, the GNNs based missing data imputation methods are limited in randomly missing cases within a set of sampled sensors. They can not be generalized to kriging tasks, where a new sensor set unseen during the training phase is present during the inference phase.Iii Methodology
Iiia Problem Description
Our work focuses on the same realtime spatiotemporal kriging task as in [wu2020inductive] (see Figure 1). Let denote a set of time points. Suppose we have data from sensors during a historical period ( in Figure 1, corresponding to sensors ). Note that we use three terms—sensor, location and node—interchangeably throughout this paper. We denote the available training data by a multivariate time series matrix . Our goal is to infer a kriging model based on , with being the set of model parameters, which encode the spatiotemporal dependencies within the training data. Figure 1 also shows a test data example with time points. It should be noted that in our setting the kriging tasks could vary over time (or for each test sample) given sensor availability: some sensors might be not functional, some sensors may retire, and new sensors can also be introduced. Moreover, the number of interested time points can also vary over time. Taking the test sample in Figure 1 as an example; our spatiotemporal kriging task is to estimate the the signals on unknown locations/sensors in black star (i.e., sensors {10,11}) based on observed data (in green) from sensors (i.e., a new sensor 9 emerged during kriging). The missing data also exist in the sampled locations. Obviously, the learned spatiotemporal dependencies in model should be generalizable to unseen spatiotemporal points, which makes our problem more challenging than missing data imputation. Given the variation in data availability, we also prefer to have a model that is invariant to the size of the matrix and .
IiiB Spatial Aggregation and Temporal Convolution Networks (SATCN)
We introduce SATCN—a novel deep learning architecture—as the spatiotemporal kriging model , where is the set of parameters. We design SATCN to generate the estimation of based on input and additional information of the underlying graph, i.e., . The proposed SATCN consists of two basic building blocks: spatial aggregation networks (SANs) and temporal convolutional networks (TCNs) in alternating order (see Figure 2). The input size and output size of SATCN are and , respectively, where means the size reduction as a result of TCN (e.g., if we use two TCNs with a width kernel in SATCN, the size reduction will be ). The spatial aggregation layer is built on a special graph neural network—principal neighborhood aggregation (PNA) [corso2020principal]. Note that the input of SATCN also contains the unknown sensors on which we will perform kriging (see the masked spatial aggregation layer in Figure 2); however, we propose a masking strategy to forbid the unknown locations to send messages to their neighbors. We next introduce the details of SATCN. We summarize some important notations related with SATCN definition and the masking strategy in Table I.
Symbol  Description 
number of simulated unknown locations  
number of nearest neighbor for message passing  
the time length of kriging target  
the time length reduction caused by temporal convolution  
the masked adjacency matrix of the th sample for the first spatial aggregation layer  
the adjacency matrix of the th sample for the higher spatial aggregation layer  
batch size for sampling and training  
the kernel length of the th temporal convolution layer 
IiiC Training Sample and Adjacency Matrix Construction
Our first step is to prepare training samples from the historical dataset . The random sampling procedure is given in Algorithm 1. The key idea is to randomly simulate unknown locations among the observed locations. We also show an example of training sample generation in Figure 3. As SATCN uses GNNs to capture the spatial dependencies, we construct a graph over locations/sensors for training. In such a graph, the unknown locations cannot pass messages to their neighbors. We define the adjacency matrix of this graph according to the following rule:
(2) 
where is the set of unknown sensors with , is the set of nearest neighbors for the th sensor in known set , is the distance between the sensors and , and is the maximum distance between any two sensors in the training data. In some applications, the missing locations are evolving with time (e.g., some observations from satellites are obscured by clouds). To deal with those cases, the adjacency matrix should be timeevolving, that is, the locations with missing data are always forbidden to send messages to their neighbors. We also set the values of masking locations to , ensuring the model has no access to unknown observations; however, the set values have no impact on our model as the masking locations have been forbidden to send messages. Considering that SATCN contains multiple spatial aggregation layers, we expect the unknown sensors to also generate meaningful information after the aggregation in the first layer. Therefore, we define the adjacency matrix of the subsequent SAN layer as:
(3) 
where is the set of nearest neighbors for the th sensor.
To train the model, we use all spatiotemporal points in as test data. We use the mean absolute error (MAE) term to measure the reconstruction loss between output and all observed true labels . The learning objective of SATCN is
(4) 
IiiD Spatial Aggregation Network (SAN)
Based on the constructed adjacency matrices and , the goal of performing spatial aggregation is to capture the distribution of messages that a sensor receives from its neighbors. Most existing approaches use a single aggregation method, with and being the most popular choices in spatiotemporal applications (see e.g., [KipfW17, xu2018powerful]). However, the single aggregation method has two limitations: 1) in many cases, the single aggregation fails to recognize the small difference between different messages [corso2020principal, dehmamy2019understanding]; and 2) the true distances to neighbor locations have a large impact on kriging, but existing GNNs do not fully consider the effect of distances. Therefore, for the spatiotemporal kriging task, it is necessary to introduce distancesensitive aggregators in SATCN. Inspired by the recent work of principal neighborhood aggregation proposed by corso2020principal, we leverage multiple aggregators on nodes and edges (links) to capture the spatial dependencies. Specifically, we introduce the following aggregators to SATCN:

Mean: ,

Weighted Mean: ,

Softmax: ,

Softmin: ,

,

Mean distance: ;

Standard distance deviation:
,
where denotes the feature vector of node in the th layer, denotes the distance from to given by the th layer adjacency matrix, means that there exists message passing flow from to , and in standard deviation is a small positive value to ensure is differentiable.
Mean aggregation treats all neighbors of a node equally without considering the effect of distance. Weighted mean aggregation takes distance weights into consideration. Softmax aggregation and Softmin aggregation give indirect measures for the maximum and minimum value of the received messages, which offer more generalization capability of GNN [velivckovic2019neural]. The aggregations above are essentially the same as in [corso2020principal], capturing the variation of features that one node receives from its neighbors. In addition, we include Mean distance aggregation and Standard deviation distance aggregation, which characterize the distribution of spatial distances from a certain node to the neighboring nodes. To make GNNs better capture the effect of spatial distance, we suggest further adding logarithmic (amplification) and inverse logarithmic (attenuation) scalers [xu2018powerful, corso2020principal]:
(5)  
where equals to , is the adjacency matrix constructed by all training locations using the rule in Eq. (3).
We then combine the aggregators and scalers using the tensor product :
(6) 
where
is an identity matrix,
is to multiply all scalers and aggregators together and then stack them on top of each other. We add weights and activation function to
obtaining SAN:(7) 
where is the th layer output at th time point, is the th layer weights, is the bias, is the number of scalers, is the number of aggregators and is the activation function. For each time point , the equal spatial aggregation operation with the same weights and is imposed on in parallel. By this means, the SAN can be generalized in 3D spatiotemporal variables, denoted as “”. For example, the inputs of SATCN is with size of , will result in a tensor. Given a . will result in a tensor with size . We illustrate the operation for the Masked SAN (the first layer of SATCN) in Figure 4.
IiiE Temporal Convolutional Network (TCN)
To capture temporal dependencies, we take advantage of the temporal convolution networks (TCNs). The advantage of TCNs on the kriging task includes: 1) TCNs can deal with input sequences with varying length, allowing us to perform kriging for test samples with different numbers of time points; 2) TCNs are superior to recurrent neural networks (RNN) with fast training and lightweight architecture [gehring2017convolutional]. A shared gated 1D convolution is applied on each node along the temporal dimension. A width TCN passes messages from
neighbors to the target time point. Note that we do not use padding in our TCN, and thus a width
TCN will shorten the temporal length with number of . TCN maps the input to a tensor(8) 
where . is the shared 1D convolutional operation, is the convolution kernel. A benefit brought by TCN is that our model is not dependent on the length of training/test time period. In our model, data of each time point is only related to other points within a receptive frame determined by . The length of temporal window has no impact on our model as time points beyond the receptive frame cannot pass information to the target time point.
Iv Experiments
In this section, we conduct experiments on several realworld spatiotemporal datasets to evaluate the performance of SATCN.
Iva Experiment Setup
IvA1 Dataset Descriptions.
We evaluate the performance of SATCN on six realworld datasets in diverse settings: (1) METRLA^{1}^{1}1https://github.com/liyaguang/DCRNN is a 5min traffic speed dataset collected from 207 highway sensors in Los Angeles, from Mar 1, 2012 to Jun 30, 2012. (2) MODIS^{2}^{2}2https://modis.gsfc.nasa.gov/data/ consists of daily land surface temperatures measured by the Terra platform on the MODIS satellite with 3255 downsampled grids from Jan 1, 2019 to Jan 16, 2021. It is automatically collected by MODIStsp package in R. This dataset poses a challenging task as it contains 39.6% missing data. (3) USHCN^{3}^{3}3https://www.ncdc.noaa.gov/ushcn/introduction consists of monthly precipitation of 1218 stations in the US from 1899 to 2019. The temporal resolutions of METRLA is 5min. With regards to spatial distance, we compute the pairwise haversine distance matrices for MODIS and USHCN; METRLA uses travel distance on the road to determine the reachability among sensors in the transportation network. For METRLA and USHCN, we test 4 cases: (a) 7T8S: The first 70% (in time) and approximately 80% locations/sensors (in space) are used for training, the other 20% sensors during the last 30% time periods are used for evaluation. (b) 5T5S: The first 50% and approximately 50% locations/sensors are used for training, the other 50% sensors during the last 50% time periods are used for evaluation. (c) 7T8S3M: We additionally add 30% missing data to the observations (80% sampled sensors) under case 7T8S. (d) 5T5S5M: We additionally add 50% missing data to the observations under case 5T5S. For all cases, 10% sampled sensors are used as validation data. The MODIS dataset itself contains missing data, we only evaluate our model on case 7T8S and 5T5S for MODIS.
IvA2 Baseline Models.
We choose both traditional kriging models and stateoftheart deep learning models as our baseline models. The group of traditional models includes: (1) kNN: Knearest neighbors, which estimates the signal on unknown sensors by taking the average values of K nearest sensors in the network. (2) OKriging: ordinary kriging [cressie1988spatial], which corresponds to a Gaussian process regression model with a global mean parameter. We implement OKriging with the autoKrige function in R package automap. The OKriging method only uses spatial dependencies. We tried to implement a spatiotemporal kriging baseline via the R package gstat. However, learning a proper spatiotemporal variogram is very challenging give the large size of the datasets. Thus, we did not include a spatiotemporal kriging baseline in this work. (3) GLTL: Greedy Lowrank Tensor Learning, a transductive tensor factorization model for spatiotemporal cokriging [bahadori2014fast]. GLTL can handle the cokriging problem with multiple variables (e.g., ). We reduce GLTL into a matrix version, as our task only involves one variable for all the datasets. In addition, we compare SATCN with the following stateoftheart deeplearning approaches: (4) IGNNK: Inductive Graph Neural Network for kriging [wu2020inductive], an inductive model that combines dynamic subgraph sampling techniques and a spectral GNN [li2018diffusion] for the spatiotemporal kriging task. We use a Gaussian kernel to construct the adjacency matrix for GNN as in [wu2020inductive]:
(9) 
where stands for adjacency or closeness between sensors and , is the distance between and , and is a normalization parameter. This adjacency matrix can be considered a squaredexponential (SE) kernel (Gaussian process) with as the lengthscale parameter. Different from [wu2020inductive] which chooses empirically, we first build a Gaussian process regression model based on the training data from one time point and estimate the lengthscale hyperparemter , and then define . We find that this procedure improves the performance of IGNNK substantially compared with [wu2020inductive]. (5) KCNSage: In [appleby2020kriging], several GNN structures are proposed for kriging tasks, and we use KCNSage based on GraphSAGE [hamilton2017inductive] as it achieves the best performance in [appleby2020kriging]. The original KCN models cannot be directly applied under the inductive settings. To adapt KCNs to our experiments, we use Eq. (3) to construct the adjacency matrices of KCNSage and Algorithm 1 to train the model.
IvA3 Evaluation Metrics
We measure model performance using the following three metrics:
(10) 
(11) 
where and are the true value and estimation, respectively, and is the mean value of the data.
Data  Case  SATCN  KCNSAGE  IGNNK  GLTL  Okriging  kNN 
METRLA  7T8S  8.35/5.40  8.52/5.61  9.08/5.82  11.04/7.55  /  11.62/6.33 
5T5S  9.83/6.38  10.29/6.54  11.22/7.12  /  /  12.79/7.04  
7T8S3M  9.07/5.72  9.22/5.99  9.93/6.45  12.48/8.47  /  11.84/6.79  
5T5S5M  10.23/6.65  10.61/6.78  11.59/7.70  /  /  12.88/7.46  
USHCN  7T8S  3.54/2.14  3.56/2.18  3.62/2.22  4.12/2.67  3.73/2.26  3.76/2.27 
5T5S  3.71/2.25  3.78/2.35  3.85/2.37  4.31/2.81  3.86/2.37  3.87/2.41  
7T8S3M  3.67/2.22  4.12/2.58  4.49/2.73  4.14/2.69  3.99/2.47  4.50/2.74  
5T5S5M  4.03/2.45  4.84/3.12  6.02/3.89  4.76/3.12  4.42/2.80  5.52/3.60  
MODIS  7T8S  1.38/0.94  1.42/0.95  /  /  1.62/1.09  1.70/1.14 
5T5S  1.54/1.01  1.58/1.05  /  /  4.89/3.02  4.90/3.18 
IvB Overall Performance
IvB1 Performance Comparison.
In Table. II, we present the results of SATCN and all baselines on six datasets. As the spatial relationship of METRLA is determined by road reachability, we cannot apply OKriging–which directly defines locations in a geospatial coordinate—on these two datasets. As can be seen, the proposed SATCN consistently outperforms other baseline models, providing the lowest error for almost all datasets under all cases. SATCN generally outperforms the spectralbased GNN counterparts— IGNNK on those datasets, we also observe that SATCN and KCNSage take less samples to converge compared with the spectral approach IGNNK. It indicates that the spatial GNNs are more suitable for inductive tasks compared with the spectral ones.
Another interesting finding from Table II is that SATCN performs well on MODIS dataset and 5T5S5M cases, which contain high ratio missing data in the observed samples. IGNNK and GLTL sometimes fail to work under those conditions because that data shows substantial spatially correlated corruptions (i.e., the cloud). SATCN is robust to missing data as it ensures that every location has observable neighbors using the adjacency matrix construction rule given in Eq. (2).
As expected, SATCN is more advantageous compared to other baselines when the missing ratio is higher. SATCN significantly outperforms other baselines under 5T5S5M for USHCN dataset. Its advantages may be attributed to the utilization of TCN, since temporal dependencies are more important when fewer nodes’ information is observable.
Figure 5 gives the temporal visualization of kriging results for one sensor from METRLA dataset under 5T5S case. It is clear that SATCN model produces the closest estimation toward true values. With the learned temporal dependencies, SATCN can better approximate the sudden speed drop of morning peak during 5:00 AM10:00 AM due to the benefits from the temporal dependencies. We also visualize the results on USHCN dataset under 5T5S5M case in Figure 6, it is obvious that SATCN outperforms other methods in this sparsely observed case. To qualitatively illustrate the performance of SATCN under missing data, we also visualize the interpolation results on the areas covered by clouds in Figure 7. From the results we can find that SATCN gives a more physically consistent predictions for areas covered by clouds. Results of KCNSAGE and KNN contain more smallscaled anomalies because they do not take advantage of temporal and distance information. OKriging generates oversmoothed predictions, which are not consistent with the observations of known areas.
IvB2 Parameter Sensitivity Analysis:
SATCN has many parameter settings, the key parameters include the number of neighbors for message passing, the TCN kernel length (temporal receptive field), the choice of aggregators and scalers, number of spatial aggregation and temporal convolution layers, the number of simulated unknown nodes, and the number of hidden neurons.
To evaluate the impact of , we fix TCN kernel length to 2, the number of hidden neurons to 32, the number of simulated unknown locations to 10 for METRLA 7T8S case, and vary in . The results (mean value and standard deviation of last 8 steps after convergence) are reported in Figure 8(a). The models with three neighbors achieve the lowest MAE. We speculate that traffic data can only propagate in a small spatial range. The correlation between close locations is strong but diminishes quickly when the distance increases. To evaluate the impact of temporal kernel length , we fix to 3, and vary from 1 to 3. The results are given in Figure 8(a). Compared with , only affects SATCN marginally. Varying the temporal receptive field only has a little impact on SATCN. Surprisingly, the model with also performs well, and it gives a relatively lower deviation. A potential reason is that the strong spatial consistency alone could be sufficient to support the spatiotemporal kriging task. In other words, we can achieve a reasonably good result by performing kriging on each time snapshot.
The spatial aggregation network contains numerous aggregators and scalers. To distinguish their contribution, we also study the effects of different aggregatorscaler combinations (see Figure 8(c)). We evaluate several models: ALL denotes the model with all aggregators and scalers in Eq (6); WSC denotes the model without logarithmic and inverse logarithmic scalers; WME denotes the model without mean related aggregators and ; WSO denotes the model without softmax and softmin aggregators; WSTD denotes the model without standard deviation aggregator ; WDIS denotes the model without distancerelated aggregators , and . In Figure 8(c), WME gives the worst performance. It suggests that the mean and normalized mean values of the neighbors contain the most important information for kriging. The models WSC, WSTD, WSO and WDiS all give worse performance than the model with all aggregators and scalers. It suggests that using multiple aggregators and scalers can improve the kriging performance. In particular, the model WSTD gives relatively worse performance compared with models except WME, and its error deviation is larger than other models. It indicates that the message passing information within standard deviation aggregators can make the kriging model easier to learn. The logarithmic and inverse logarithmic scalers have the smallest impact on model accuracy. However, they contain some global information in the training graph, as they are based on the average weighted degree of the graph constructed by all training locations.
The expressiveness of graph neural networks can be theoretically improved by increasing the depth and width of GNNs [loukas2019graph]. However, there are several factors that impede deeper and wider GNNs to perform better [rong2019dropedge]. To study the effect of SATCN depth, we fix TCN kernel length to 2, the number of nearest neighbor to 3, the number of simulated unknown locations to 10, and vary the number of SATCN blocks in with hidden neuron fixed to in Figure 8(d). First, the SATCN model with only one layer is very difficult to converge, and give higher MAEs compared with deeper competitors. In SATCN, the first layer only utilizes the masked adjacency matrix , which may lose some important information. Therefore the model with only one SATCN block is not sufficient to perform accurate kriging. Second, the model with three SATCN blocks are worse than the one with two blocks. It might be caused by the fact that the target dataset itself is locally correlated. A deeper SATCN will enlarge the spatiotemporal receptive frame of every unknown location, and thus incorporate unrelated information to the kriging model. To study the width of the SATCN model, we fix the depth of SATCN to 2, and vary the number of hidden neurons in in Figure 8(d). We observe a large improvement when reaches 32. However, there is no evidence that the models with 64 and 128 hidden neurons are better than the one with 32 neurons. We only observe that the models with larger
exhibits lower variance on test MAEs.
A key element of SATCN is the training sample generation algorithm in Algorithm 1, which generates random graph structure by randomly masking. We also study the effects of the number of simulated unknown locations by varying it in . The results are given in Figure 8(f). First, the model trained with samples with and give lower MAEs compared to the model without random masking (). It indicates that randomly masking strategy can make the trained graph neural networks more generalizable to a completely new graph. Second, the model with a very large gives high MAEs, and the standard deviation of the evaluation errors during different training episodes are particularly high. This makes sense in that a very large will make the generated graph structures too fractured/sparse, makes it difficult for the graph neural networks to converge.
V Conclusion
In this paper, we propose a novel spatiotemporal kriging framework—SATCN, which uses spatial graph neural networks to capture spatial dependencies and temporal convolutional networks to capture temporal dependencies. Specifically, SATCN features a masking strategy to forbid message passing from unobserved locations and multiple aggregators to allow the model to better characterize the spatial dependencies. We evaluate SATCN on diverse types of realworld datasets, ranging from traffic speed to global temperature. Our results show that SATCN offers superior performance compared with baseline models in most cases. SATCN is also robust against missing data, and it can still work well on datasets with missing data ratio up to nearly 50%. SATCN is flexible in dealing with problem of diverse sizes in terms of the number of the nodes and the length of time window. This flexibility allows us to model timevarying systems, such as moving sensors or crowdsourcing systems. The masked spatial aggregation network proposed in this paper can also be viewed as a graph neural network for general missing data cases. This framework can be further integrated into time series forecasting frameworks with under missing data.
Comments
There are no comments yet.