List of Abbreviations
AdaDT  Adaptive Boosted Decision Tree 
CNN  Convolutional Neural Networks 
ETR  Extra Trees Regressor 
GBRT  Gradient Boosted Decision Tree 
GCN  Graph Convolutional Networks 
GCNN  Gated Convolutional Neural Networks 
GLU  Gated Linear Unit 
GRU  Gated Recurrent Units 
IoT  Internet of Things 
LSTM  Long Short Term Memory 
MAE  Mean Absolute Error 
MAPE  Mean Absolute Percentage Error 
RBF  Radial Basis Functions 
RF  Random Forest 
RMSE  Root Mean Squared Errorss 
RNN  Recurrent Neural Networks 
STGCN  SpatioTemporal Graph Convolutional Network 
TGCN  Temporal Graph Convolution Networks 
Xgboost  eXtreme Gradient Boosting 
1 Introduction
In the era of information explosion, with the advancements in the Internet of Things (IoT) ren2017serving , a large amount of heterogeneous data is collected from multiple points that belong to different monitoring sources and are scattered in different locations meng2020survey . However, the complexities of heterogeneous data result in difficulties with regard to information exchange and decision making. To leverage these existing heterogeneous data and transform them into reliable and accurate information, data fusion is commonly used to address the problem of fusing heterogeneous data.
Data fusion fuses raw data collected from multiple monitoring points to obtain improved information hall1997introduction ; Ngiam2011689 . Typically, the fused information provides a more accurate description than independent observations and increases the robustness and confidence of the data application; the fused data are thus considered to have more complete value than the individual raw observations barcelo2019self ; Gite201685 . Furthermore, data fusion is capable of extending both the temporal and spatial coverage information and further decreasing the ambiguities and uncertainties of multiple monitoring station measurements Guo2019215 . Data fusion is widely applied in the fields of air traffic control, robotics, manufacturing, medical diagnostics, and environmental monitoring.
A typical application scenario involving heterogeneous data sources is the field of environmental monitoring. In this scenario, a common application for the heterogeneous data collected is air quality prediction. Air quality prediction provides information for government measures, including traffic restrictions, plant closures, and restrictions on outdoor activities. Typically, different data from multiple distinct monitoring stations are required to develop a prediction model Wong2004404
. Air quality depends on external factors, and one of the most important factors is the set of meteorological conditions that vary over time and space. This implies that having a large amount of heterogeneous data related to meteorological conditions from meteorological monitoring stations is crucial for air quality prediction. In recent years, as stateoftheart technology, machine learning algorithms have been gradually introduced to develop models for air quality prediction using given datasets. The commonly utilized dataset consists primarily of heterogeneous data collected from diverse air quality monitoring stations and meteorological monitoring stations located in a study area
He201811 ; Enebish2020 ; Elangasinghe2014106 ; Maharani2019 .Although this kind of predictive model performs well, it suffers from several limitations.
First, for these common predictive models, a typical practice is that meteorological station data (e.g., temperature, humidity, wind speed, wind direction, etc.) from the vicinity of the air quality monitoring stations are directly taken as input. However, the data collected from meteorological stations have diverse spatial and temporal distributions that result from the influence of each external factor and some complicated interactions from combinations of external factors. These common models ignore the effects of the complex multivariate relationships among various monitoring targets to different degrees. These relationships are reflected in spatial and temporal correlations.
Second, these predictive models primarily focus on local information and commonly adopt a limited number of air quality monitoring stations for prediction. In reality, it is difficult to present these local predictions as an indication of future trends for the entire study area.
Third, previous research studies primarily considered their study areas as gridbased regions to acquire geospatial information Le202055 ; Zhang20194341 . Although the required computation is relatively fast when partitioning a research area into grids, it is challenging to extract irregular spatial correlations from these regular rectangular areas. Convolutional neural networks (CNNs) are effective in extracting local patterns from spatially correlated data; however, they are generally applicable to standard gridbased data and are not suitable for nongrid data.
To address the above problems, in this paper, considering the spatial correlations among monitoring points (e.g., air quality monitoring stations), we propose a deep learning method for fusing heterogeneous data collected from multiple monitoring points using graph convolutional networks (GCNs) to predict future trends of some observations. The essential idea behind the proposed method is first to fuse the heterogeneous data collected based on the locations of the monitoring points and then to perform prediction based on global information (i.e., fused information from all monitoring points within a study area) rather than local information (i.e., information from a single monitoring point within a study area).
The contributions of this paper can be summarized as follows.
(1) We consider the spatial distribution of monitoring points and fuse the heterogeneous data collected from these monitoring points that contain spatial information using the RBFbased method.
(2) We connect all monitoring points in the entire study area to construct a fully connected graph and thus replace the traditional single monitoring point or few monitoring points with graphstructured data.
(3) We employ a deep learning model called the STGCN that combines gated convolutional neural networks (GCNNs) and the GCN layer utilizing a particular structure to predict the future trends of some observations [30]. The GCN layer captures spatial correlations between the constructed graphstructured data. The GCNN layer captures temporal correlations between the constructed sequence data.
(4) We evaluate the proposed method by applying it for air quality prediction based on a realworld dataset collected from multiple monitoring points.
The rest of this paper is organized as follows. Section 2 describes the proposed method in detail. Section 3 applies the method in a real case and analyzes the results. Section 4 discusses the advantages and shortcomings of the proposed method, and the potential future work. Section 5 concludes the paper.
2 Methods
2.1 Overview
In this paper, we propose a deep learning method for fusing heterogeneous data collected from multiple monitoring points using GCNs to predict the future trends of several observations. First, we obtain and clean the raw heterogeneous data from multiple monitoring points scattered over the study area. Second, we fuse the cleaned data using the proposed RBFbased method by considering the spatial correlations. Third, we construct the spatially and temporally correlated data based on the above fused data as the input for further prediction. Fourth, based on the constructed spatiotemporal data, we build a novel deep learning model called the spatiotemporal graph convolutional network (STGCN) for prediction. The predictive model combines GCN layers that can be used to capture the spatial correlations between the constructed data and GCNN layers that can be used to capture the temporal correlations between the constructed data.
2.2 Step 1: Data Collection and Cleaning
With the advancement of IoT technology, the number of monitoring sources is increasing rapidly. As a result, the number of monitoring targets is rapidly increasing, and large amounts of heterogeneous data are generated within a short period of time. When utilizing heterogeneous data collected from multiple points, the data should first be cleaned to extract valuable information from these raw data. Specifically, the first step is checking the quality of the data to identify incorrect, inconsistent, missing, or redundant information. The second step is to clean (i.e., denoise) the data, including removing incorrect values and dealing with missing values.
2.3 Step 2: Data Fusion
We assume that there are a total number of monitoring points scattered across a study area. For a given types of monitoring source , a study area has monitoring targets that vary with the type of monitoring sources. At each time point, for a given monitoring point that belongs to monitoring source , denotes one of its observations, and represents the monitoring target. In the entire study area, all monitoring sources monitor a total of targets, .
Typically, a pair of monitoring points with a short distance between them have a stronger correlation than that of a pair with a long distance between them tobler1970computer . Based on this conception, we propose a method to fuse heterogeneous data collected from these monitoring points. The proposed method considers the spatial location of each monitoring point in a study area. We incorporate the spatial correlations between these independent observations into a distancebased fusion method, thus fusing the data from one set of monitoring points with data from the other set of monitoring points according to spatial information, both of which belong to different monitoring sources, according to spatial information.
In a given study area, we assume that a set of monitoring points , belongs to a type of monitoring source , and its observation is . Furthermore, there is also another set of monitoring points , , belonging to another type of monitoring source , and its observation is . Based on the distance between monitoring points and , we fuse the observation of point into point and fuse the observations of point into point . Therefore, for the heterogeneous data collected from scattered monitoring points, a function can be employed to perform fusion. When , we have .
A linear combination of RBFs is able to construct a function of these monitoring points (see Eq. (1)). As a distancebased function, RBFs are particularly applicable to weight and fuse large amounts of data from the spatial correlations between multiple scattered monitoring points Boyd20101435 . In this paper, we employ an RBFbased method to perform fusion. Here, we adopt the Gaussian basis function, i.e., the Gaussian RBF, as our RBF (see Eq. (2)).
(1) 
(2) 
where is the Gaussian RBF. represents the Euclidean distance. For two monitoring points and , the distance . The weights
can be estimated by solving the linear system of equations.
First, we calculate the Euclidean distances between the above monitoring points and obtain a distance matrix of size . Then, these distances are input into the Gaussian RBF to obtain a coefficient matrix , and the size of is also (see Eq. (3)). Next, the observations from the
monitoring points are defined as a vector
. Thus, a linear group of equations is obtained, thus leading to the aforementioned weights . Finally, the above monitoring points yield observations, i.e., fused data, where denotes the monitoring target of the monitoring source (see Eq. (4)).(3) 
(4) 
After performing the RBFbased fusion process, a fusion matrix (FM) is obtained for a single time step (see Eq. (5)). Each row of the matrix represents a monitoring point, and each column represents a monitoring target. There are monitoring points and monitoring targets in a study area. For multiple time steps , the size of the fusion matrix is . The fusion matrix can be used to construct spatiotemporally correlated data.
(5) 
The detailed process of RBFbased fusion is illustrated in Figure 2 and Figure 3. We assume that there are 13 monitoring points distributed in the given study area. These monitoring points belong to three different monitoring sources; different sources are represented by different colors: green for monitoring source 1, blue for monitoring source 2, and yellow for monitoring source 3. The number of monitoring targets depends on the type of monitoring source. Here, monitoring source 1 has two monitoring targets, monitoring source 2 has three monitoring targets, and monitoring source 3 has two monitoring targets, i.e., there are seven monitoring targets total in the study area.
We take monitoring point as an example to introduce how to perform the RBFbased fusion. As illustrated in Figure 2, belongs to monitoring source , and thus, the purpose of the fusion operation is to fuse the observations of both monitoring source (containing three monitoring targets) and monitoring source (containing two monitoring targets) into the data of .
First, we calculate the Euclidean distances between the monitoring points that belong to monitoring source and thus obtain a distance matrix and a coefficient matrix with sizes of . The coefficient matrix is:
(6) 
Second, are observations from the monitoring target of monitoring source 2. We can solve for from the linear equation :
(7) 
Third, the distances of these four points from monitoring source 2 to monitoring point are calculated as , , , and (see Figure 3).
Finally, by inputting these distances into the Gaussian RBF (see Eq. (4)), the result is obtained, in which monitoring point fuses the data from the monitoring target of monitoring source 2. As illustrated in Figure 3, by repeating this process, a fusion matrix with a size of is obtained at a single time step. Here, 13 represents 13 monitoring points and 7 represents 7 monitoring targets. After adding the temporal information, the assembled fusion matrix can be used to construct spatiotemporal correlation data for further model predictions.
2.4 Step 3: Data Construction
In this section, we introduce the process of data construction for spatially and temporally correlated data.
2.4.1 Construction of Spatially Correlated Data
We adopt a weighted fully connected graph to represent the nonEuclidean spatial correlations among multiple monitoring points, where is a set of nodes corresponding to the data collected from monitoring points in a study area. Each node is connected to every other node, implying the existence of correlations between all monitoring points in the entire study area (each pair of nodes has an edge ). The number of edges is . A weighted adjacency matrix is used to represent the similarities between nodes: . is defined as in Eq. (9). In the matrix, each element is the weight of each edge , representing the spatial correlation between two nodes (i.e., monitoring points and ). The weight is calculated using Gaussian similarity functions based on the distance between both nodes. A greater weight implies a stronger correlation between the two nodes.
(8) 
(9) 
where denotes the distance between monitoring points and , and
is the standard deviation of distances, which controls the width of the neighborhoods
VonLuxburg2007395 .2.4.2 Construction of Temporally Correlated Data
To represent the temporal correlations among heterogeneous data collected from multiple monitoring points, we construct a fused observation vector of monitoring points at time step , where each element records historical observations (i.e., fused data from monitoring targets) for a monitoring point. Specifically, a frame denotes the current status of the fusion matrix at time step . The corresponding information is stored in the fully connected graph . A typical time series prediction problem requires the use of data from the previous time steps to predict data for the next time steps. Based on vector , the prediction problem can be defined as in Eq. (10).
(10) 
There are monitoring points in a study area that monitor a total of targets. Then, is fusion matrix FM for the previous time steps. contains the prediction results of monitoring targets for the next time steps.
2.5 Step 4: Data Modeling
In this section, we introduce details of a novel deep learning model called the spatiotemporal graph convolutional network (STGCN), which adopts the constructed spatiotemporal data as input. First, we introduce the GCN that is used to capture spatially correlated representations from the constructed graphstructured data. Second, we introduce the GCNN that can be used to capture temporally correlated representations from the constructed sequence data. Finally, we introduce the STGCN, which fuses the spatial and temporal representations to predict the future trends of some observations.
2.5.1 Use of the Graph Convolutional Network (GCN)
In this section, we introduce how to use GCN to capture spatially correlated representations from the constructed graphstructured data.
Recently, GCNs have attracted increasing attention; they generalize conventional convolution for data with nonEuclidean structures Hammond2011129 and are thus perfectly suitable for extracting spatial correlations from graphstructured data. Structurally, through action on the nodes of the fully connected graph and their neighbors, the GCN is able to capture spatial correlations between the nodes and their surrounding nodes and encode the attributes of the nodes (i.e., different information stored at each node).
There are two common types of methods for performing graph convolution, i.e., spatial and spectral methods; in this paper, we focus on the latter type. Specifically, graph convolution operations learn the spatially correlated representations of graphstructured data in the spectral domain using graph Fourier transforms. According to the theory of spectral graph convolution, the graph convolution operator can be defined as
, which denotes the multiplication between graph signals with a kernel . A Fourier basis acts on the nodes of the fully connected graph and their firstorder neighbors to capture the spatial correlations among these nodes. Consequently, the convolutional process of a GCN can be described as follows: a graph signal is filtered by a kernel with multiplication between the kernel and the graph Fourier transform (see Eqs. (11) and (12)) Shuman201383 . The structure of the GCN is illustrated in Figure 6.(11) 
(12) 
where denotes a normalized Laplace matrix.
is matrix composed of eigenvectors from
.denotes an identity matrix.
denotes the diagonal degree matrix with . Both and filterdenotes the diagonal matrix of eigenvalues of
.To model higherorder neighborhood interactions in the graph, the multiple graph convolutional layers can be stacked, and thus, a deep architecture can be constructed to capture deep spatial correlations Duvenaud20152224 . The multilayer model requires scaling and normalization. Then, the graph convolution is further expressed as in Eq. (13).
(13)  
where is a single shared parameter of the kernel , and are normalized to , and ; is a matrix with a selfconnection structure, is the identity matrix. denotes the diagonal degree matrix with .
The above graph convolution operator is mainly applicable to graph signals . For a graph signal with channels ( here refers to the monitoring targets),
is able to be extended to multidimensional tensors, and is defined as in Eq. (
14).(14) 
where denotes a normalized Laplace matrix. are the input dimensions, are the output dimensions. , where is the kernel size for the graph convolution.
2.5.2 Use of the Gated Convolutional Neural Networks (GCNN)
In this section, we introduce how to use GCNNs to capture temporally correlated representations from the constructed sequence data.
Compared to the traditional CNN, a GCNN adds a special gating mechanism that allows it to be used to capture the temporal correlations from timeseries data. Compared to the RNN, which is a traditional time series analysis model based on a complex gating mechanism, the GCNN has a simpler structure, which enables it to respond faster for dynamic changes and consequently train faster and independently of previous steps 21 . Furthermore, the GCNN is capable of obtaining the size of the space between each cell (i.e., time step) based on filters, thus allowing it to further capture the relationships between the different observations in the time series data.
The structure of the GCNN is illustrated in Figure 6
. The GCNN employs a causal convolution as a temporal convolution layer, followed by a gated linear unit (GLU) as a nonlinear activation function. These temporal convolutional layers are stacked using residual connections. The GLU determines which information is passed to the next level. By handling sequence data with a nonrecursive approach, the temporal convolution layer is easily computed in parallel and ameliorates the gradient explosion problem that exists in traditional RNNbased learning methods. As a 1D convolution layer (1DConv), the temporal convolution layer is convoluted along the temporal dimension with a filter size of
for exploring neighbors of the input elements.For the constructed sequence data, the input of the temporal convolution layer at each node is considered as a sequence of length . Each node has channels, and then the input is
. Therefore, the convolution involves a linear transformation that takes
channels, consecutive data at a time, and turns them into output channels. As a result, the length of the output sequence is less than the length of the input. Given the filter (i.e., the convolution kernel) , the temporal convolution can be expressed as Eq. (15).(15) 
where both , , , and are model parameters, and denotes kernels and biases, respectively. denotes convolution operation. is elementwise product. denotes sigmoid gate, it determines which information of input are relevant to the structure and dynamic evolution for time series data. The same convolution kernel is employed for all nodes in the graph , and thus defined as . denotes a fusion matrix of time steps.
2.5.3 Use of the SpatioTemporal Graph Convolutional Networks (STGCN)
In this section, we introduce how to use the STGCN to predict the future trends of some observations based on the spatial and temporal representations described above. The STGCN is a deep learning model that includes multiple spatiotemporal convolutional blocks (STConv blocks). The STGCN is designed to handle graphstructured time series data for fusing spatial and temporal representations and performing prediction 22 .
Figure 7 illustrates the specific structure of the STConv block that contains two temporally gated convolution layers and a spatial graph convolution layer is placed between them as a connection. The design decreases the number of channels , thus decreasing the number of parameters involved in the computation and speeding up the training process. Taking sequence data as a set of inputs, is the input of the lth STConv block. is the output, and it can be obtained by Eq. (16).
(16) 
where and denotes the two temporal layers of STConv block from top to bottom, respectively. denotes the kernel for graph convolution.
is the rectified linear units function.
A typical STGCN stacks two STConv blocks and ends with an extra time convolution and a fully connected layer. Here, the time convolution layer maps the output of the last STConv block to the fully connected layer, which is regarded as an output layer. The prediction results for all monitoring points in the entire study area are returned from this fully connected layer. Moreover, the STGCN measures the performance of the model according to its L2 losses, and the predictive loss function is defined as follows.
(17) 
where is prediction of model, denotes ground truth. is the previous time steps. denotes trainable parameters.
The detailed modeling process done by employing the STGCN is illustrated in Figure 8. In this paper, the channels of the three layers in the STConv block are set as 32, 8, and 32.
First, the input data are fed into the GCNN layer, and the 1D Conv handles the temporal information of the monitoring points. For each monitoring point with input length , its input channels are simultaneously convoluted along the time dimension. Consequently, the output length of each point is . The convolution kernel in the GCNN maps each input to its individual output. Then, the GLU is used to activate it. After consuming the data with channel 2 and length , the GLU generates a single data with length . Once all the data have been processed, each node (i.e., monitoring point) in the graphstructured data has 32 channels.
Next, the input tensor resulting from the temporal convolution is fed into the GCN layer. At each time step, its input for the GCN layer consists of monitoring points and 32 channels. All monitoring points are connected via a fully connected graph. For a single monitoring point, the remaining monitoring points connected to the point are selected as a subset. The graph convolution operator acts on a subset of monitoring points by reweighting the relevant data. The selected points and the weighting calculation are determined based on the graphstructured data. The input channel for each time step is 32. The number of output channels in the GCN layer is 8. The convolution operation within STConv block 2 is the same as the above process. The spatiotemporal correlations between the constructed data are captured through continuous convolution.
Finally, an output layer (i.e., a fully connected layer) performs the ultimate prediction and then outputs a tensor with a size of , i.e., the predicted values of monitoring points at a single time step.
3 Results: A Real Case
In this section, we apply the proposed method to predict air quality based on a realworld case involving heterogeneous data collected from multiple monitoring points. Moreover, we evaluate and analyze the results.
3.1 Data Fusion
In this section, we introduce heterogeneous data fusion based on a real dataset. First, we describe the real dataset in detail and perform RBFbased fusion. Second, we analyze the consistency of the data fusion output using three diverse indicators. Finally, we evaluate the effectiveness of the data fusion process using five machine learning algorithms.
3.1.1 Data Description
We collected a real dataset related to air quality in Beijing for the period of January 1, 2017, to February 1, 2018, from the Harvard Dataverse 23 . The dataset consists of two types of monitoring sources: 35 meteorological monitoring stations that collect data related to meteorological conditions (e.g., temperature, humidity, wind speed, etc.) and 17 air quality monitoring stations that collect data related to air quality (e.g., concentrations of various pollutants). These monitoring stations are scattered in different regions throughout the city. Each station records a variety of data every hour. Therefore, the data collected from the multiple monitoring stations in this real case are typical heterogeneous data.
We utilized three observations (i.e., temperature, humidity, and wind speed) from the meteorological monitoring stations, and one observation (i.e., the concentration of PM2.5) from the air quality monitoring stations. The concentration of PM2.5 was adopted as the predicted value. Figure 13 illustrates the layout of a total of 53 monitoring stations. The frequency of the measurements was once an hour. The dataset was also aggregated into hourly intervals. Each observation for each station contains 8784 records.
We employed the proposed RBFbased data fusion method at each time step t. After cleaning the data, we utilized linear interpolation to impute missing values for three consecutive hours. Subsequently, we fused the data from the 35 air quality monitoring stations and 17 meteorological monitoring stations with a total of four monitoring targets into a fusion matrix of size
. 53 denotes the total number of monitoring stations, and 4 denotes the total number of monitoring targets.3.1.2 Consistency Analysis of Data Fusion
We analyzed the consistency of data fusion using three indicators. First, we compared the variances of the raw and fused data. Second, we compared the distributions of the kernel functions for the raw and fused data. Third, we compared the raw and fused data in terms of their time series distributions for different monitoring targets.
First, we compared the variances of the raw and fused data, as illustrated in Figure 9. The variance of the fused data achieved satisfactory consistency with the variance of the raw data. This result implies that the proposed RBFbased fusion method enables heterogeneous data to achieve a data distribution similar to that of the raw data. Furthermore, for one observation (e.g., the concentration of PM2.5), we compared the temporal distribution of the variance of the fused data with that of the raw data. The results indicate that the trend exhibited satisfactory consistency.
Second, we compared the kernel density estimation of the raw and fused data, as illustrated in Figure
10. A kernel density estimation is an indicator for comparing the distributions of two batches of data by estimating their probability densities. The density trajectories of the raw and fused data appear to be almost superimposed, further demonstrating that the distribution of the fused data is remarkably similar to that of the raw data.
Third, we compared the raw and fused data in terms of their time series distributions for different monitoring targets. The time series distributions of the raw and fused data are illustrated in Figure 11. Obviously, the raw and fused data for the four monitoring targets maintain temporal consistency.
3.1.3 Effectiveness Evaluation of Data Fusion
To further evaluate the effectiveness of the proposed RBFbased data fusion method, we randomly selected an air quality monitoring station and compared its predictions based on raw data and fused data by executing five ensemble machine learning algorithms. The machine learning algorithms used include the extra trees regressor (ETR) 24 , adaptive boosted decision tree (AdaDT) 25 , random forest (RF) 26 , gradient boosted decision tree (GBRT) 27 , and eXtreme gradient boosting (Xgboost) 28 . It should be noted that we simply aimed to perform a rough comparison between the raw and fused data, and thus the number of estimators for these ensemble machine learning algorithms were set to 200, with default values used for all remaining parameters. Here, the raw data predictions were directly obtained using data from nearby meteorological monitoring stations as features. Each data sample was divided into training and test datasets at a ratio of 8:2.
We adopted four common metrics to evaluate these five models’ performances: (1) mean absolute error (MAE), (2) mean absolute percentage error (MAPE), (3) root mean square error (RMSE), and (4) Rsquared (); see Eqs. (18), (19), (20), and (21). For the MAE, MAPE, and RMSE, a smaller metric value indicates a better performance by the prediction model. For , in the range of 0 to 1, a larger metric indicates a better performance by the prediction model. All metrics results of the above five ensemble machine learning algorithms are presented in Table 1 and Figure 12.
(18) 
(19) 
(20) 
(21) 
where and denote the predicted results and ground truth, respectively. denotes the total number of all predicted values.
Algorithm  Data  MAE  RMSE  MAPE  
ETR  Raw data  26.33  1201.10  0.0037  113% 
Fused data  23.74  1151.63  0.0947  101%  
AdaDT  Raw data  29.25  1626.05  0.0034  107% 
Fused data  28.53  1128.20  0.1131  90%  
RF  Raw data  24.37  1388.08  0.1513  116% 
Fused data  23.78  1122.56  0.2175  104%  
GBRT  Raw data  23.87  1259.66  0.0448  121% 
Fused data  23.55  1124.43  0.1161  105%  
Xgboost  Raw data  20.15  1005.76  0.1657  73% 
Fused data  19.77  917.21  0.2790  72% 
As presented in Table 1 and Figure 12, for all five machine learning algorithms, all the fused data have smaller MAE, RMSE, and MAPE values and larger values than those of the raw data. The above results indicate that the performances of the prediction models based on fused data are much better than those based on raw data. The effectiveness of the data derived from the proposed RBFbased data fusion method for developing predictive models is thus demonstrated.
3.2 Spatiotemporal Predication Using Graph Neural Networks
3.2.1 Construction of Spatially and Temporally Correlated Data
Data construction consists of (1) the construction of spatially correlated data and (2) the construction of temporally correlated data. For the spatially correlated data, to construct a weighted fully connected graph, we connected the 53 monitoring stations to each other to assemble a weighted matrix with a size of that represents the spatial relationships among multiple monitoring stations with a graphical structure. The values in the matrix denote the similarity between the monitoring stations in the study area. The visualization of is illustrated in Figure 13, where the darker the color is, the greater the correlation. For the temporally correlated data, we defined the data interval as once every hour. Consequently, the size of the constructed sequence data is , where 8784 is the total length of the time series with hourly intervals, 53 is the total number of monitoring stations scattered across the entire study area, and 4 is the sum of the numbers of monitoring targets for all monitoring sources. Each value of the constructed sequence data represents the change in each data point over time. Furthermore, we exploited the minmax normalization method to scale the values in the range of . Finally, 60% of the data were utilized for training, 20% of the data were utilized for validation, and the remaining 20% were utilized for testing.
3.2.2 Experimental Settings
The experimental settings included (1) the implementation and testing environment, (2) the determination of model parameters, and (3) the baseline methods adopted for performance evaluation purposes.
We implemented the proposed deep learning model based on the PyTorch framework
29 . The training process of the model was performed on a workstation computer equipped with a Quadro P6000 GPU with 24 GB of GPU memory, an Intel Xeon Gold 5118 CPU, and 128 GB of RAM.The detailed hyperparameter settings for the model were as follows. The first step is to set the model parameters. Both the size of the graph convolution kernel in the GCN and the size of the time convolution kernel in the GCNN were 3. In addition, we added a dropout layer with a rate of 0.3. The next step is to set the training parameters of the model. We set the learning rate to 0.001, the batch size to 32, and the training period to 350, and we employed the Adam optimizer to minimize the loss function during the training process
30 . We utilized 12 hours as the historical time step for predicting the concentration of PM2.5 in the next step (i.e., 3 hours). Furthermore, we adopted two common metrics, i.e., the MAE and RMSE, to evaluate the performance of the model.To comparatively evaluate the performance of the proposed method, we employed three other baseline deep learning methods: (1) LSTM, a variant of the recurrent neural network (RNN) that is able to analyze timeseries data 31 ; (2) GRU, a variant of the RNN that is able to analyze timeseries data 32 ; and (3) TGCN, a model which fuses the spatial and temporal correlations of a single observation to perform prediction 33 . Furthermore, we compared the performance of the STGCN model based on different numbers of monitored targets, expressed as STCGNK1, STGCNK2, STGCNK3, and STCGNK4. K1 refers to utilizing the predictive targets (i.e., concentrations of PM2.5) as inputs. K2 refers to utilizing the predictive targets and one sampling observation as inputs. K3 refers to utilizing the predictive targets and two sampling observations as inputs. K4 refers to utilizing all observations in the dataset as inputs.
3.2.3 Predicted Results
We compared the performances of the aforementioned models on the constructed data; for details on the data construction process, see Subsection 2.4. The results are presented in Table 2 and Figure 14.
Model  MAE  RMSE 
LSTM  0.0158  0.0300 
GRU  0.0150  0.0200 
TGCN  0.0134  0.0241 
STCGNK1  0.0129  0.0189 
STGCNK2  0.0127  0.0194 
STGCNK3  0.0129  0.0195 
STCGNK4  0.0120  0.0184 
As illustrated in Table 2 and Figure 14, the STGCN model achieves the best performance when compared to those of all the baseline models. Among these models, LSTM and GRU perform predictions utilizing time series data for a single observation (i.e., concentration of PM2.5) of a single monitoring station as input. Both the TGCN and STGCN employ graph convolution to predict time series for the constructed graphstructured data, i.e., they both take spatiotemporal correlations into account. Their MAE values are greater than those of the traditional LSTM and GRU. The STGCN is capable of performing predictions utilizing multiple observations (i.e., external factors that influence air quality) as inputs, while the TGCN performs predictions based on the spatiotemporal correlation of a single piece of information (i.e., air quality itself). The STGCN obtained the best performance. Furthermore, among the four STGCN models, the best performance was obtained by STGCNK4 because its input contains all the external information about the fusion matrix, i.e., weather, temperature, and wind speed; the more information that is fused, the better the model performs. However, STGCNK3 did not perform as well as STGCNK2, indicating that the prediction results are presumably related to combinations of the different observations.
To further investigate the effect of different amounts of observations on the model prediction results, we plotted the RMSE and MAE curves of the four STGCN models. As illustrated in Figure 15, the performances of these STGCN models are similar.
(a) RMSE versus the epochs. (b) MAE versus the epochs
We further investigated the influences that different combinations of observations have on the prediction results. For the four monitoring targets of the utilized dataset, PM denotes the concentration of PM2.5, HU denotes the humidity, TE denotes the temperature, and WS denotes the wind speed. Different combinations of observations were fed into the STGCN model, and the prediction results are listed in Table 3.
Model  MAE  RMSE 
STGCNPMHU  0.0127  0.0194 
STGCNPMTE  0.0123  0.0189 
STGCNPMWS  0.0124  0.0191 
STGCNPMHUTE  0.0129  0.0195 
STGCNPMHUWS  0.0126  0.0190 
STGCNPMTEWS  0.0131  0.0193 
STGCNALL  0.0120  0.0184 
Figure 16 demonstrates the different effects of different combinations of observations, including PMHU (PM2.5 and humidity), PMTE (PM2.5 and temperature), PMWS (PM2.5 and wind speed), PMHUTE (PM2.5, humidity, and temperature), PMHUWS (PM2.5, humidity, and wind speed), PMTEWS (PM2.5, temperature, and wind speed), and ALL (all observations, i.e., K4). The results indicated that the predictions using all the fused information achieve the best performance.
4 Discussion
In this paper, by considering the spatial correlations among monitoring points in specific application scenarios, we propose an RBFbased method to fuse heterogeneous data depending on the distances between monitoring points and present a novel deep learning method that uses a GCN to perform prediction when using processed data. In this specific air quality prediction scenario, monitoring points that belong to different monitoring sources and are scattered across different locations collect various observations over time. The results demonstrate the consistency and effectiveness of the fused data obtained by the proposed RBFbased method and further demonstrate the effectiveness of the employed deep learning method.
In this section, we discuss the advantages and shortcomings of the proposed method. Moreover, we discuss some potential future work that may be conducted to address these shortcomings.
4.1 Advantages of the Proposed Method
The proposed deep learning method achieves satisfactory prediction performance based on processed data (i.e., the data are first fused and then constructed; for details, see Section 2). The advantage and essential idea of the proposed method are the consideration of the spatial correlations in heterogeneous data collected from multiple monitoring points for specific application scenarios.
First, we consider the spatial distribution of the monitoring points scattered in a study area and fuse the heterogeneous data collected from the monitoring points with spatial information using the RBFbased data fusion method. This RBFbased fusion method has the following advantages. (1) Each monitoring point is able to obtain data depending on the distances between itself and its neighbors as well as the corresponding information from these neighbors. This implies that the RBFbased method allows for exploring multiple correlations among heterogeneous data according to their locations and expressing the complex interactions among external factors as spatial correlations. Then, the spatial correlations are incorporated into the distancebased fusion method, thus achieving highly accurate fusion. (2) The RBFbased method has a better interpretable result than those of other methods. (3) As a function dependent on distance, the RBF is suitable for the weighting and fusion of large amounts of data from multiple points scattered across a study area, thereby implying that the proposed model is highly applicable to heterogeneous data with explicit location information.
Second, we propose replacing the traditional single monitoring point or several monitoring points with graphstructured data. This approach has the following advantages. (1) The local information is replaced by global information. (2) When the fused information is embedded within the fully connected graph constructed from all the monitoring points, a prediction regarding the future trends for the entire study area can be achieved. (3) The necessity of selecting the appropriate monitoring points is eliminated, thus increasing the efficiency of the method. (4) The graphstructured data do not suffer from the situation where a limited number of monitoring points are adopted and information from other monitoring points is ignored.
Finally, we consider the capture of the spatial correlations between the constructed data by employing a stateoftheart deep learning model (GCN). The GCN extracts the information stored at each node (i.e., monitoring point) through convolutional operations on the graph, and it is thus particularly applicable to nongrid data. Furthermore, we employ a deep learning model named STGCN that combines a GCNN (i.e., a deep learning model that can capture temporal correlations from the constructed sequence data) and the GCN (i.e., a deep learning model that can capture spatial correlations from the constructed graphstructured data) and thus achieves satisfactory prediction performance.
4.2 Shortcomings of the Proposed Method
There are three limitations of the proposed method. First, the method is designed specifically for the fusion of data from monitoring points with explicit location information. Therefore, it is not applicable to datasets without location information. Second, we only adopt the fully connected graph to construct the graphstructured data. In the above case, the real dataset consists of a limited of 53 nodes. Consequently, the interactions between nodes can be considered. Once a large number of nodes are distributed in a study area, the construction of a fully connected graph increases the computational cost of the method. Finally, regarding the deep learning predictions, the utilized dataset contains a limited 4 observations. The current results only indicate that the obtained predictions are best when employing all fused information. However, it is difficult to interpret the effects of the different combinations of only a few observations on the predictions of the model.
4.3 Outlook and Future Work
In the future, (1) we plan to evaluate the applicability of the proposed deep learning approach in other similar scenarios. Similar scenarios refer to the continuous collection of heterogeneous data from multiple monitoring stations scattered across a study area over time. The collected data include the relevant location information. “Applicability” refers to achieving satisfactory performance for spatiotemporal prediction utilizing fused data. (2) We also plan to employ the kNearest Neighbors search algorithm (kNN) to locally construct graphstructured data. This is because in some cases, there may be a large number of monitoring points, and the use of kNN to construct the graphstructured data may significantly reduce the computational cost of the model. (3) We plan to employ other graph convolutional neural networks (e.g., GAT 34 , GraphSAGE 35 , Diffpool 36 ) to capture the spatial correlations among the fused data.
5 Conclusion
In this paper, we propose a deep learning method for fusing heterogeneous data collected from multiple monitoring points using a GCN to predict the future trends of some observations. The proposed method is applied in a real air quality prediction scenario. The essential idea behind the proposed method is to (1) fuse the heterogeneous data collected based on the locations of the monitoring points considering their spatial correlations and (2) perform prediction based on global information (i.e., fused information from all monitoring points in the entire study area) rather than local information (i.e., information from a single monitoring point or several monitoring points in a study area). In the proposed method, (1) a fusion matrix is assembled using RBFbased fusion; (2) a weighted adjacency matrix is constructed using Gaussian similarity functions; (3) a sequence of fused observation vectors is constructed based on the time series information of the above fusion matrix; and (4) the STGCN is employed to predict the future trends of several observations and is fed the above constructed data (i.e., the weighted adjacency matrix and the sequence of fused observation vectors). The results obtained on the real air quality dataset demonstrate that (1) the fused data derived from the RBFbased fusion method achieves satisfactory consistency; (2) the performances of the compared prediction models based on fused data are better than those based on raw data; and (3) the STGCN model achieves the best performance when compared with those of all baseline models. The proposed method is applicable for similar scenarios where continuous heterogeneous data are collected from multiple monitoring points scattered across a study area. Future work will focus on the application of the proposed method in scenarios where there are large numbers of monitoring points.
Acknowledgments
This research was jointly supported by the National Natural Science Foundation of China (Grant Nos. 11602235), and the Fundamental Research Funds for China Central Universities (2652018091). The authors would like to thank the editor and the reviewers for their helpful comments and suggestions.
Reference
References
 (1) J. Ren, H. Guo, C. Xu, Y. Zhang, Serving at the edge: A scalable iot architecture based on transparent computing, IEEE Network 31 (5) (2017) 96–105. doi:10.1109/MNET.2017.1700030.
 (2) T. Meng, X. Jing, Z. Yan, W. Pedrycz, A survey on machine learning for data fusion, Information Fusion 57 (2020) 115–129. doi:10.1016/j.inffus.2019.12.001.
 (3) D. L. Hall, J. Llinas, An introduction to multisensor data fusion, Proceedings of the IEEE 85 (1) (1997) 6–23. doi:10.1109/5.554205.
 (4) J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, A. Ng, Multimodal deep learning, 2011, pp. 689–696.
 (5) J. M. BarceloOrdinas, M. Doudou, J. GarciaVidal, N. Badache, Selfcalibration methods for uncontrolled environments in sensor networks: A reference survey, Ad Hoc Networks 88 (2019) 142–159. doi:10.1016/j.adhoc.2019.01.008.
 (6) S. Gite, H. Agrawal, On context awareness for multisensor data fusion in iot, Advances in Intelligent Systems and Computing 381 (2016) 85–93. doi:10.1007/978813222526310.
 (7) K. Guo, T. Xu, X. Kui, R. Zhang, T. Chi, ifusion: Towards efficient intelligence fusion for deep learning from realtime and heterogeneous data, Information Fusion 51 (2019) 215–223. doi:10.1016/j.inffus.2019.02.008.
 (8) D. Wong, L. Yuan, S. Perlin, Comparison of spatial interpolation methods for the estimation of air quality data, Journal of Exposure Analysis and Environmental Epidemiology 14 (5) (2004) 404–415. doi:10.1038/sj.jea.7500338.
 (9) H.D. He, M. Li, W.L. Wang, Z.Y. Wang, Y. Xue, Prediction of pm2.5 concentration based on the similarity in air quality monitoring network, Building and Environment 137 (2018) 11–17. doi:10.1016/j.buildenv.2018.03.058.
 (10) T. Enebish, K. Chau, B. Jadamba, M. Franklin, Predicting ambient pm2.5 concentrations in ulaanbaatar, mongolia with machine learning approaches, Journal of Exposure Science and Environmental Epidemiologydoi:10.1038/s4137002002578.

(11)
M. Elangasinghe, N. Singhal, K. Dirks, J. Salmond, S. Samarasinghe, Complex time series analysis of pm10 and pm2.5 for a coastal site using artificial neural network modelling and kmeans clustering, Atmospheric Environment 94 (2014) 106–116.
doi:10.1016/j.atmosenv.2014.04.051.  (12) D. Maharani, H. Murfi, Deep neural network for structured data  a case study of mortality rate prediction caused by air quality, Vol. 1192, 2019. doi:10.1088/17426596/1192/1/012010.
 (13) V.D. Le, T.C. Bui, S.K. Cha, Spatiotemporal deep learning model for citywide air pollution interpolation and prediction, 2020, pp. 55–62. doi:10.1109/BigComp48618.2020.0099.
 (14) Y. Zhang, Q. Lv, D. Gao, S. Shen, R. Dick, M. Hannigan, Q. Liu, Multigroup encoderdecoder networks to fuse heterogeneous data for nextday air quality prediction, Vol. 2019August, 2019, pp. 4341–4347. doi:10.24963/ijcai.2019/603.
 (15) W. R. Tobler, A computer movie simulating urban growth in the detroit region, Economic geography 46 (sup1) (1970) 234–240. doi:10.2307/143141.
 (16) J. Boyd, Error saturation in gaussian radial basis functions on a finite interval, Journal of Computational and Applied Mathematics 234 (5) (2010) 1435–1441. doi:10.1016/j.cam.2010.02.019.

(17)
U. Von Luxburg, A tutorial on spectral clustering, Statistics and Computing 17 (4) (2007) 395–416.
doi:10.1007/s112220079033z.  (18) D. Hammond, P. Vandergheynst, R. Gribonval, Wavelets on graphs via spectral graph theory, Applied and Computational Harmonic Analysis 30 (2) (2011) 129–150. doi:10.1016/j.acha.2010.04.005.

(19)
D. Shuman, S. Narang, P. Frossard, A. Ortega, P. Vandergheynst, The emerging field of signal processing on graphs: Extending highdimensional data analysis to networks and other irregular domains, IEEE Signal Processing Magazine 30 (3) (2013) 83–98.
doi:10.1109/MSP.2012.2235192.  (20) D. Duvenaud, D. Maclaurin, J. AguileraIparraguirre, R. GómezBombarelli, T. Hirzel, A. AspuruGuzik, R. Adams, Convolutional networks on graphs for learning molecular fingerprints, Vol. 2015January, 2015, pp. 2224–2232.
 (21) Y. Dauphin, A. Fan, M. Auli, D. Grangier, Language modeling with gated convolutional networks, Vol. 2, 2017, pp. 1551–1559.
 (22) B. Yu, H. Yin, Z. Zhu, Spatiotemporal graph convolutional networks: A deep learning framework for traffic forecasting, Vol. 2018July, 2018, pp. 3634–3640. doi:10.24963/ijcai.2018/505.
 (23) H. Wang, Air pollution and meteorological data in Beijing 20162017 (2019). doi:10.7910/DVN/RGWV8X.
 (24) P. Geurts, D. Ernst, L. Wehenkel, Extremely randomized trees, Machine Learning 63 (1) (2006) 3–42. doi:10.1007/s1099400662261.
 (25) H. Drucker, Improving regressors using boosting techniques, Proceedings of the 14th International Conference on Machine Learning 97 (1997) 107–115.
 (26) L. Breiman, Random forests, Machine Learning 45 (1) (2001) 5–32. doi:10.1023/A:1010933404324.
 (27) J. Friedman, Greedy function approximation: A gradient boosting machine, Annals of Statistics 29 (5) (2001) 1189–1232. doi:10.1214/aos/1013203451.
 (28) T. Chen, C. Guestrin, Xgboost: A scalable tree boosting system, 2016, pp. 785–794. doi:10.1145/2939672.2939785.
 (29) A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, A. Lerer, Automatic differentiation in pytorch.
 (30) D. Kingma, J. Ba, Adam: A method for stochastic optimization, 2015.
 (31) S. Hochreiter, J. Schmidhuber, Long short term memory, Neural Computation 9 (8) (1997) 1735–1780. doi:10.1162/neco.1997.9.8.1735.
 (32) J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, arXiv preprint arXiv:1412.3555.
 (33) L. Zhao, Y. Song, C. Zhang, Y. Liu, P. Wang, T. Lin, M. Deng, H. Li, Tgcn: A temporal graph convolutional network for traffic prediction, IEEE Transactions on Intelligent Transportation Systems 21 (9) (2020) 3848–3858. doi:10.1109/TITS.2019.2935152.
 (34) P. VeliČkoviĆ, A. Casanova, P. Lió, G. Cucurull, A. Romero, Y. Bengio, Graph attention networks, 2018.
 (35) W. Hamilton, R. Ying, J. Leskovec, Inductive representation learning on large graphs, Vol. 2017December, 2017, pp. 1025–1035.
 (36) R. Ying, C. Morris, W. Hamilton, J. You, X. Ren, J. Leskovec, Hierarchical graph representation learning with differentiable pooling, Vol. 2018December, 2018, pp. 4800–4810. doi:10.5555/3327345.3327389.
Comments
There are no comments yet.