1. Introduction
Despite its high practicality, traffic forecasting is a complex task since the speeds of all nodes are highly dominated by their historical signals as well as the conditions of the neighboring nodes. To handle spatialtemporal datasets, recent studies (LiYS018; YuYZ18; WuPLJZ19; ZhengFW020; ParkLBTJKKC20; itr2.12044) introduce deeplearningbased models in traffic forecasting and consider the graph structure in the training process. While these studies have made great progress in the traffic forecasting task, little attention has been given to analyzing the errors in the traffic forecasting models.
In this work, we analyze the errors in the traffic forecasting models and observe that recent models still produce relatively large errors in certain patterns regardless of their high average performance. While these failures are considered unpredictable, we found that we can estimate how models will fail using current errors. In the real world, it is presumable that correlations among successive errors (i.e., failures) exist. However, previous errors have been ignored in the existing deeplearningbased traffic forecasting methods. Based on our findings, we highlight the necessity of utilizing previous errors in traffic forecasting. To explicitly handle the correlation of errors, we utilize historical errors of predictions, i.e., residuals, to make the next prediction. This can correct the failures of the traffic forecasting models and improve their performance in unexpected situations. Consequently, the critical mistakes that are crucial in a realtime setting can be minimized.
In addition to the previous predictions within a single sensor node, the previous predictions of the neighboring nodes are also highly correlated with the current prediction of each sensor node. To consider both spatial and temporal residual correlations, we propose a simple residual estimation module called ResCAL that estimates the expected residuals in the current prediction, i.e., how forecasting models will fail. Our method adopts the spatialtemporal layers conducted with a gated temporal convolutional network (Gated TCN) and a graph convolutional layer (GCN), as proposed in (WuPLJZ19)
. Furthermore, we analyze the patterns of failures since high errors occur in certain patterns. To this end, we introduce vector quantization in ResCAL and provide the justification of the calibrations. Vector quantization also allows our method to handle the noise and the outliers that appear in residuals.
Fig. 1 depicts how ResCAL corrects the failures in the realtime setting. Several attempts have been made to consider the residuals in graph classification (JiaB20; HuangHSLB21); however, to the best of our knowledge, this study is the first attempt to capture the temporal correlation of residuals in traffic forecasting. In our experiments, we confirm that the residuals of each node are highly correlated with its previous residuals as well as that of neighboring nodes. Here, we introduce a simple synthetic dataset and validate the correctness of our ResCAL both qualitatively and quantitatively. Subsequently, we conducted extensive experiments in the two most representative traffic datasets: METRLA and PEMSBAY. In both datasets, we calibrated the predictions from various traffic forecasting models such as STGCN (YuYZ18), DCRNN (LiYS018), Graph WaveNet (WuPLJZ19), and STAWnet (itr2.12044). Specifically, we focus on whether our ResCAL accurately calibrates the predictions in the time areas where the existing models struggle. Despite its simplicity, our ResCAL consistently improves performance in event situations. Along with correcting failures, we validate the effectiveness of the vector quantization approach with a qualitative analysis. In this analysis, we verify that our ResCAL can provide meaningful justification for the calibration by examining the residual patterns in the unobserved data.
Our contributions can be summarized as follows:

We highlight the importance of utilizing the historical predictions to explicitly handle the correlation among errors that occur in a realtime setting.

We propose a novel method called ResCAL as a widelyapplicable addon module, which estimates the future residuals and corrects the failures of models.

Extensive experiments demonstrate that our ResCAL consistently improves the baselines in event situations while significantly reducing the correlation of residuals.
2. Related Works
2.1. Traffic Forecasting
Traffic forecasting is a challenging task due to the complicated spatial and temporal dependencies among the sensor nodes. To capture the dynamics of traffic conditions, datadriven approaches based on deep learning have received considerable attention. In spatial and temporal modeling, approaches based on graph convolutional networks (GCNs) (KipfW17; ZhangSXMKY18; AtwoodT16) are promising in capturing spatial dependencies among roads. Several studies (ZhangZQLY16; ChengZZX18; WuT16)
have proposed applying recurrent neural networks (RNNs) and convolutional neural networks (CNNs) to capture the temporal dependencies along the sequence in traffic forecasting.
Recent studies (LiYS018; YuYZ18; WuPLJZ19; ZhengFW020; ParkLBTJKKC20; itr2.12044) have shown that graph modeling is a key factor to achieve stateoftheart performance in traffic forecasting. Particularly, DCRNN (LiYS018) demonstrates its impressive performance against statistical approaches by incorporating diffusion graph convolutional neural networks (AtwoodT16) into RNNs. STGCN (YuYZ18) utilizes only convolutionbased approaches for modeling spatial and temporal dependencies in the road network. Graph WaveNet (WuPLJZ19) introduces a selfadaptive adjacency matrix to overcome the limitation of applying fixed adjacency information and uses dilated convolutions (OordDZSVGKSK16) to efficiently handle long sequences. As another approach, STAWnet (itr2.12044) applies a selfattention mechanism (VaswaniSPUJGKP17) to capture spatial dependencies between roads and uses selflearned node embedding to eliminate the need of prior knowledge of the graph structure. While these studies have shown remarkable progress in solving the traffic forecasting problem, methods to consider the errors of the forecasting models have been underexplored. In the line of traffic forecasting research, our work can improve the performance of the existing forecasting models by calibrating predictions of those models in real time.
2.2. Residual Correction
In statistical approaches, residual is defined as the difference between the ground truth values and the predicted ones. Statistical timeseries models such as the autoregressive model and movingaverage model represent future values as a linear combination of observed values and residuals at previous times steps. The autoregressive integrated moving average (ARIMA)
(box1976time)model considers residuals with autoregressive terms. However, these approaches suffer from expressing nonlinearity because the residual term is expressed as a finite linear combination of a whitenoise sequence.
Fitting residuals in the regression problem is an effective technique in machine learning. For example, a gradient boosting algorithm
(friedman2001greedy)and its variants, such as XGBoost
(chen2015xgboost) and LightGBM (KeMFWCMYL17), recursively capture the residuals to improve the performance. Moreover, graph neural network architectures to model residual correlation have also been proposed. (JiaB20)suggested modeling of the joint distribution of the residuals to obtain information from both the input feature and the output correlation in the graph structure.
(HuangHSLB21)proposes a procedure called “Correct and Smooth”, which models error correlation to correct the base prediction from simple architectures such as multilayer perceptron (MLP). In contrast to these works, our ResCAL considers both spatial and temporal residuals to enhance the performance of traffic forecasting models.
2.3. Discrete Representations
Utilizing the discrete representation can provide interpretability and reduce the noise for a given data (ChenCDHSSA16; OordVK17; JangGP17). One of the discretization approaches is vector quantization (gray1984vector), which is suggested as a system for mapping a signal into a digital sequence. VQVAE (OordVK17) utilizes this quantization mechanism to model the categorical distribution, and the latent variable is represented as the combination of the embedding vectors. The vector quantization layer formulates a latent space, called codebook, and clusters vectors according to a given distance metric (e.g.
, L2 distance). To learn a discrete representation by the backpropagation algorithm,
(JangGP17) suggests GumbelSoftmax which can approximate categorical samples by a differentiable sampling mechanism. In this work, we utilize discrete representation to enhance the interpretability of the calibration module and to reduce the noise introduced in traffic forecasting.3. Residual Diagnostics in Traffic Forecasting
High Errors in Traffic Forecasting. In the real world, a few critical errors have a huge influence on traffic conditions. Therefore, predicting such errors is essential for the traffic forecasting task. On a widely used traffic dataset, METRLA (LiYS018), we observed that top errors account for about of the total absolute errors, e.g., for STGCN and for DCRNN. In traffic dynamics, these abnormal cases are mainly caused by sparse events. In this work, we specifically focus on time steps with top errors and denote them as event situations since a large magnitude of error indicates the failure of the prediction.
Residual Autocorrelation in Traffic Forecasting. Autocorrelation is the correlation of a time series and its delayed copy. Given an input sequence , the autocorrelation function (ACF) measures the degree of the linear relationship between and where is a time lag:
(1) 
High autocorrelation indicates a high potential of performance improvement; any forecasting model of which residuals are correlated or residuals have a nonzero mean can be improved by estimating the future residuals (hyndman2018forecasting). The residuals of the traffic forecasting model are commonly autocorrelated; therefore, we explicitly capture this relationship to further enhance the model performance.
Fig. 2 (b) shows the ACF of residuals for the traffic forecasting models. Before calibration, the residuals have a high autocorrelation, meaning that predictable information remains in the residuals. Fig. 2 (c) represents the Pearson Correlation between the current residual of each sensor node and the previous residuals of neighboring sensor nodes on METRLA. The bright points in the heatmap of the existing forecasting models show that the residuals in the neighboring sensor nodes are highly correlated. By calibrating the predictions, the correlation among the residuals can be significantly reduced, as shown in the ACF plot and the heatmap.
4. Residual Correction in Traffic Forecasting
In this section, we introduce a simple modelagnostic framework to boost the performance of traffic forecasting models in the realtime setting. To this end, we first formally describe the problem setting considered in traffic forecasting. Next, the model architecture and the residual prediction mechanism of our ResCAL are introduced.
4.1. RealTime Traffic Forecasting
The aim of traffic forecasting is to predict the future traffic speed observed at correlated sensors on the road network. Let be a graph representing the spatial relationship between sensors where is a set of nodes () and is a set of edges. Following the convention of the traffic prediction problem, is denoted as the graph signal obtained at time ; and represent the speed and the timestamp features, respectively. The goal of traffic forecasting is to learn a mapping function from past graph signals and a graph to future traffic speeds:
(2) 
where and are the input and output sequence lengths, respectively. In particular, we consider a realtime traffic forecasting problem in which the current prediction of the model can be corrected using continuous historical predictions. Let be the ground truth speed so that represents the ground truth speed of step ahead prediction at time , and is its estimated values, i.e., . Then, the residual of traffic prediction is defined as . Note that the ground truth of can be observed when time is greater than or equal to . is denoted as the newly observed residuals at time . Here, our main problem is learning a mapping function which predicts the residual at time , given past graph signals, observed residuals, and a graph :
(3) 
where is the time steps of the observed residuals. Given the estimated residuals , accurate predictions can be made by taking as a final output. Fig. 3 illustrates our problem setting.
Our proposed problem setting in realtime traffic forecasting has several advantages, as follows: (i) The correlation between successive residuals can be explicitly modeled. We may consider resizing the window size of input to handle this correlation in the time series model. However, this can be tricky depending on the model design and is timeconsuming to retrain due to its high complexity. (ii) As most parts of the complex modeling is done by , we can learn
with a relatively lightweight model to estimate the residuals. This allows hyperparameter tuning with a small budget and increases the reusability of the model in realworld situations. (iii) Since information about
is not needed to predict the residuals, the performance of the base forecasting model can be improved in a modelagnostic way.4.2. ResCAL
To explicitly model the residuals, we designed a lightversion of the traffic forecasting model called ResCAL consisting of spatialtemporal layers with the selfadjacency matrix, as proposed by (WuPLJZ19). The overall architecture of our ResCAL is depicted in Fig. 4. The encoder consists of spatialtemporal layers, each conducted by a gated temporal convolutional layer (Gated TCN) and a graph convolutional layer (GCN). Following the conventional setting of the base traffic forecasting models, we let . The encoder produces a latent by taking both a graph signal and a residual :
(4) 
where Concat is a concatenation operation along the second axis and is the number of hidden dimensions.
Here, we want to make a more accurate prediction by capturing useful information from and simultaneously provide a reason for the judgment. However, it is difficult to directly analyze since we do not put any strict restrictions on generating . Moreover, the noise introduced by the baseline model or data further hinders the analysis of . Previous works (OordVK17; JangGP17)
introduce discrete latent vectors into an autoencoder to increase the interpretability and stabilize training on the noisy data. However, we observe that using only these discrete forms sometimes reduces overall performance of the model. Instead, we use a hybrid approach of combining discrete and continuous representations. We take the sum of the outputs of two different branches from
: the regression branch and the quantization branch. The regression branch provides a continuous representation of the accurate estimation of the residual while the quantization branch provides interpretable and denoised information using discrete representation. The StraightThrough (ST) Gumbel estimator (JangGP17) is used to provide the differentiable discrete variable given an input :(5)  
where are i.i.d samples drawn from the standard Gumbel distribution. This variable is smoothly approximated in the backward pass. Formally, with the learnable embedding vector , the quantized vector , the estimated future residual , and the output of ResCAL is calculated as:
(6)  
where is the number of categories,
is the number of categorical variables,
is the dimension of the embedding matrix, and for ; , , andare the output layer, regression layer, and quantization layer, respectively. Each layer consists of a combination of a pointwise convolution and a ReLU activation.
By predicting the residuals, we can easily calibrate the models and further improve the existing traffic forecasting models. In realtime prediction, the previous residuals emerged by the prediction model as well as the current traffic data are used as the input and the future residuals are predicted. The clustering results can also be used to interpret the behavior of the model. In Section 6, we will show that the baselines fail similarly for similar events. Using the predicted residuals, we can now calibrate the output of the prediction model. By calibrating the future prediction with the current errors, we can consider the temporal dependency between residuals in traffic forecasting. Our experiments show that such improvements cannot be achieved solely by increasing the capacity of the base forecasting model with additional parameters or utilizing longer sequences of the input without considering the temporal dependencies of the residuals as our ResCAL does.
5. Experimental Settings
Data  # Nodes  # Edges  # Time steps 
Synthetic      10000 
METRLA  207  1515  34272 
PEMSBAY  325  2369  52116 
5.1. Traffic Dataset
We examined our ResCAL on the two representative traffic datasets: PEMSBAY and METRLA introduced by (LiYS018). PEMSBAY, collected by California Transportation Agencies (CalTrans) Performance Measurement System (PeMS), contains 325 sensor data from the Bay Area. In our experiments, 6 months of data collected from Jan 1st, 2017 to May 31st, 2017 was selected. METRLA contains the traffic speed data recorded by 207 sensors on the highways of Los Angeles County (JagadishGLPPRS14). The sensor locations of METRLA and PEMSBAY are shown in Fig. 5. Both datasets were preprocessed following (LiYS018)
. For METRLA, 4 months of data collected from Mar 1st, 2012 to Jun 30th, 2012 was selected for our experiments. For both datasets, the traffic speed is aggregated into 5 minute windows, and Zscore normalization is applied to the input. As
(ShumanNFOV13) proposed, adjacency matrix of both datasets was conducted using road distances with a threshold Gaussian kernel. A detailed description of each dataset is provided in Table 1.5.2. Baseline Methods
To widely validate the correctness and effectiveness of our ResCAL, we examined several baselines commonly used for traffic forecasting. We additionally provide the reported results for the basic models such as ARIMA and FCLSTM to help compare the performance of various models. In this work, we choose the mean squared error (MSE) for training the base models. For all baselines, the PyTorch implementation was utilized as denoted in their footnotes, and their default training strategies were followed.

ARIMA.
The autoregressive integrated moving average model with a Kalman filter, which is the most representative regression model for time series data.

FCLSTM (SutskeverVL14).
Basic deeplearningbased regression model for time series data conducted with long shortterm memory (LSTM)
(HochreiterS97) and fullyconnected layers. 
DCRNN (LiYS018).^{1}^{1}1github.com/chnsh/DCRNN_PyTorch Diffusion convolutional recurrent neural network consisting of the graph convolutional networks and recurrent neural networks.

STGCN (YuYZ18).^{2}^{2}2github.com/FelixOpolka/STGCNPyTorch Spatialtemporal graph convolutional network which is conducted with the graph convolutional layer and 1D convolutional layers.

Graph WaveNet (WuPLJZ19).^{3}^{3}3github.com/nnzhan/GraphWaveNet Traffic forecasting model which combines the dilated 1D convolutional layers and graph convolutional networks.

STAWnet (itr2.12044).^{4}^{4}4github.com/CYBruce/STAWnet Spatialtemporal attention network with temporal convolution and spatial attention mechanism to capture dynamic spatial dependencies.
Time steps  Models  MAE  RMSE  MAPE 
1 step  Seq2seq  0.029  0.047  7.37% 
+ Calibration  0.015  0.022  2.95%  
6 step  Seq2seq  0.061  0.153  20.34% 
+ Calibration  0.025  0.093  10.29%  
12 step  Seq2seq  0.123  0.294  51.70% 
+ Calibration  0.067  0.205  27.56%  
24 step  Seq2seq  0.174  0.367  80.64% 
+ Calibration  0.122  0.291  53.47% 
Data  Models  15min  30min  60min  Average  
MAE  RMSE  MAPE  MAE  RMSE  MAPE  MAE  RMSE  MAPE  MAE  RMSE  MAPE  
METRLA 
DCRNN (LiYS018)  14.53  16.31  48.63%  16.98  18.93  60.79%  20.03  22.00  75.27%  17.18  19.08  61.56% 
+ Calibration  13.46  15.44  44.22%  16.26  18.38  57.37%  19.45  21.54  73.13%  16.39  18.46  58.24%  
STGCN (YuYZ18)  15.29  16.93  53.19%  17.94  19.78  65.91%  21.17  23.00  82.76%  18.13  19.90  67.29%  
+ Calibration  12.77  15.07  43.55%  15.79  18.18  57.50%  19.10  21.44  73.93%  15.88  18.23  58.33%  
Graph WaveNet (WuPLJZ19)  13.39  15.06  41.49%  16.15  17.99  54.92%  19.08  20.93  69.24%  16.21  17.99  55.22%  
+ Calibration  13.28  14.99  41.45%  16.03  17.93  54.43%  18.94  20.85  68.34%  16.08  17.93  54.74%  
STAWnet (itr2.12044)  13.56  15.30  43.75%  16.15  18.10  56.00%  19.00  21.00  68.44%  16.24  18.13  56.06%  
+ Calibration  13.17  14.98  42.24%  15.87  17.89  54.88%  18.80  20.84  68.21%  15.95  17.90  55.11%  
PEMSBAY 
DCRNN (LiYS018)  4.41  6.07  10.51%  5.95  8.45  15.36%  7.50  10.53  20.39%  5.95  8.35  15.42% 
+ Calibration  4.25  5.93  10.07%  5.65  8.15  14.37%  6.94  10.00  18.60%  5.61  8.03  14.34%  
STGCN (YuYZ18)  6.42  7.95  15.88%  7.50  9.44  19.39%  8.74  11.03  23.50%  7.56  9.47  19.59%  
+ Calibration  3.88  5.91  9.55%  5.41  8.00  14.36%  6.84  9.78  18.94%  5.38  7.90  14.29%  
Graph WaveNet (WuPLJZ19)  4.36  5.89  10.45%  5.72  7.97  14.46%  6.89  9.50  18.07%  5.66  7.79  14.33%  
+ Calibration  4.28  5.87  10.07%  5.63  7.93  14.21%  6.81  9.46  18.02%  5.57  7.75  14.10%  
STAWnet (itr2.12044)  4.38  5.96  10.34%  5.71  7.91  14.51%  6.75  9.29  18.07%  5.61  7.72  14.31%  
+ Calibration  4.27  5.87  10.07%  5.62  7.88  14.23%  6.70  9.27  17.77%  5.53  7.67  14.02%  

5.3. Training Settings
For both METRLA and PEMSBAY, we used the same training strategy for the simplicity. spatialtemporal layers were used for the encoder, and , , and were set for the quantization branch in the decoder. Since the baselines predict the next steps in units of 5 minutes and each step has errors of 12 horizons, the input residuals and predictions have a size of and , respectively. Our ResCAL is trained with the mean absolute error (MAE) and a batch size of 256. An Adam optimizer with a learning rate of 0.001, and is also used. Each dataset is split into a training set, validation set, and test set with a ratio of 7:1:2 and the model with the best validation score is selected in all experimental evaluations.
6. Experimental Results
6.1. Synthetic Dataset
To validate the correctness of our ResCAL, we first construct a simple synthetic dataset where similar events are occurring at random time steps. This reflects the nature of traffic data which has a similar propagation of congestion in cases of accidents.
Concretely, the synthetic dataset with the lengths of time steps contains a periodic sine wave with a randomly generated zero signal, as depicted in Fig. 6
(a). The period of the sine wave is set to 50 steps and each period with zero values is randomly substituted with a probability of
to reflect the traffic dynamics. Similar to the traffic datasets, the synthetic dataset is divided into three parts: the training set, validation set, and test set with a ratio of 7:1:2. With our synthetic dataset, we examine the correctness of the following two assumptions essential for residual correction: (i) a deeplearningbased prediction model likely generates similar errors in similar types of events, and (ii) when the residuals are correlated, it is possible to improve the performance of the base prediction model by estimating the residuals that will occur, i.e., how the model fails.For the base prediction model, we build a simple sequencetosequence model (Seq2seq). Seq2seq is designed to get an input sequence of length 24 and generate predictions on the next 24 steps. The encoder and decoder of Seq2seq are conducted with GRU units with a hidden feature size of 128 and a single recurrent layer. The output of the decoder is passed to a multilayer perceptron (MLP) consisting of ReLU activations and fully connected layers of size 128161. The model is trained using an Adam optimizer (KingmaB14) with a learning rate of 0.001, ,
, and a batch size of 100. The model is trained for 50 epochs and the Zscore normalization is applied to preprocess the input.
Fig. 6 (a) shows the predictions of the Seq2seq model, the estimated residuals from our ResCAL, and the calibrated prediction, respectively. As we expected in the first assumption, the prediction model always made similar errors for similar types of events. This implies that the residuals do not occur randomly and are also predictable. To assess the second assumption, we trained our ResCAL to predict the residuals occurring in Seq2seq. For the synthetic dataset without the graph structure, the spatialtemporal layers in the encoder were replaced with 1D convolutional layers. We set , , and for the quantization branch in the decoder. For training, an Adam optimizer with a learning rate of 0.001 and a batch size of 128 was used. Since the length of Seq2seq is steps, our ResCAL gets residuals with the original time series data for the input and outputs the size of the residuals. Table 2 shows the results of Seq2seq with and without our ResCAL. Our ResCAL is shown to greatly enhance the performance of Seq2seq in every step of the predictions. This indicates that our ResCAL accurately predicts the residuals to occur. Notably, our ResCAL further improves the MAE of Seq2seq by 0.014, 0.036, 0.056, and 0.052 for 1 step, 6 steps, 12 steps, and 24 steps, respectively. Through a simple simulation, we validate that our assumptions and the proposed method are quite presumable in the time series data. The calibrated results in Fig. 6 (a) demonstrate that repeated errors can be estimated by our ResCAL. Moreover, the experiments on the synthetic data show that ResCAL allows the model to rapidly adapt to unexpected changes.
6.2. Traffic Dataset
In this section, we demonstrate that our ResCAL can correct the failure of the existing traffic forecasting models on the METRLA and PEMSBAY datasets. Our ResCAL is trained with the settings as described in Section 5.
Residual Correction in Event Situations. To reflect the nature of the realworld setting, we examine our ResCAL in regions where critical errors occur. Table 3 shows the calibration results on time steps where the absolute error of the base forecasting model falls within the top . Surprisingly, we can observe large gaps in performance before and after calibration. In the event situations on METRLA, our ResCAL improves the performance of DCRNN by 0.79, 0.62 and 3.32% for MAE, RMSE, and MAPE, respectively. Even for STAWnet, the most advanced traffic forecasting model, our ResCAL shows an improvement of 0.29, 0.23 and 0.95% for MAE, RMSE, and MAPE on METRLA, respectively. Consequently, our ResCAL can calibrate the prediction models more effectively in the cases of critical errors, which highlights the practicality of our ResCAL in realtime traffic forecasting.
As shown in Fig. 2, our ResCAL can accurately estimate the residuals and drastically reduce the autocorrelation of residuals both temporally and spatially. Fig. 6 (b) shows the qualitative results of prediction models with and without our proposed ResCAL on METRLA. When the speed drops rapidly (i.e., an anomalous event occurs), STGCN and DCRNN cannot adapt to the changes; thus, they show poor prediction performance in the changed regions (orange lines). For the same case, our ResCAL successfully captures what those models tend to overestimate and corrects them accurately (green lines). Consequently, as a modelagnostic addon module, our proposed ResCAL allows the prediction model to adapt to the fluctuation of data and consistently improves prediction performance regardless of the prediction model.
Data  Models  15min  30min  60min  
MAE  RMSE  MAPE  MAE  RMSE  MAPE  MAE  RMSE  MAPE  
METRLA 
ARIMA  3.99  8.21  9.60%  5.15  10.45  12.70%  6.90  13.23  17.40% 
FCLSTM  3.44  6.30  9.60%  3.77  7.23  10.90%  4.37  8.69  13.20%  
DCRNN (LiYS018)  3.17  5.53  8.28%  3.53  6.33  9.63%  4.02  7.32  11.40%  
+ Calibration  3.05  5.32  7.81%  3.44  6.20  9.25%  3.98  7.21  11.23%  
STGCN (YuYZ18)  3.33  5.77  8.92%  3.74  6.65  10.39%  4.32  7.72  12.52%  
+ Calibration  2.98  5.28  7.73%  3.47  6.26  9.46%  4.10  7.36  11.62%  
GWNet (WuPLJZ19)  2.88  5.10  7.30%  3.33  6.03  8.93%  3.91  7.01  10.84%  
+ Calibration  2.88  5.09  7.30%  3.33  6.02  8.85%  3.87  6.99  10.66%  
STAWnet (itr2.12044)  2.87  5.15  7.44%  3.28  6.03  8.94%  3.78  6.97  10.55%  
+ Calibration  2.85  5.08  7.30%  3.27  5.98  8.81%  3.77  6.94  10.56%  
PEMSBAY 
ARIMA  1.62  3.30  3.50%  2.33  4.76  5.40%  3.38  6.50  8.30% 
FCLSTM  2.05  4.19  4.80%  2.20  4.55  5.20%  2.37  4.96  5.70%  
DCRNN (LiYS018)  1.40  2.80  2.95%  1.79  3.87  4.04%  2.21  4.81  5.22%  
+ Calibration  1.37  2.75  2.87%  1.76  3.77  3.91%  2.16  4.64  5.00%  
STGCN (YuYZ18)  2.15  3.75  4.57%  2.42  4.40  5.35%  2.76  5.12  6.31%  
+ Calibration  1.46  2.87  3.07%  1.85  3.79  4.16%  2.27  4.61  5.29%  
GWNet (WuPLJZ19)  1.37  2.72  2.90%  1.72  3.65  3.82%  2.03  4.35  4.67%  
+ Calibration  1.36  2.72  2.84%  1.71  3.64  3.79%  2.03  4.33  4.68%  
STAWnet (itr2.12044)  1.37  2.75  2.88%  1.72  3.63  3.83%  2.01  4.25  4.68%  
+ Calibration  1.36  2.72  2.86%  1.71  3.62  3.81%  2.01  4.25  4.64%  

Residual Correction in Overall Situations. While our residual correction clearly shows its effectiveness in situations where residuals are correlated, we can also examine its performance when the existing models already correctly predict speeds. Here, we examine our method in overall situations for both METRLA and PEMSBAY to demonstrate its consistency. Table 4 shows that our ResCAL consistently improves the baselines on METRLA and PEMSBAY. This indicates that deeplearningbased models generate the correlated residuals even with their large number of parameters, and performance improvement in a high error region also leads to improvement in overall situations. For STGCN, our ResCAL achieves average improvements of 0.28 in MAE, 0.41 in RMSE, and 1.01% in MAPE on METRLA. Even with the most recent model STAWnet, our ResCAL shows improvements for MAE, RMSE, and MAPE in both METRLA and PEMSBAY.
Running Time Analysis. The average computation time required for calibrating the outputs of the base models (e.g., GraphWaveNet) was measured in realtime inference. It was tested on METRLA and 12 sequences of GraphWaveNet outputs were calibrated. A PC with an Intel Xeon Silver 4210R 2.40GHz CPU and a Titan RTX GPU was used in our analysis. On average, the calibration was performed within 9.05ms. That is, the running time of ResCAL is fast enough for practical usage. Note that the inference time of our ResCAL does not depend on the base model.
6.3. Pattern Analysis and Interpretability
Here, we show the qualitative results for the residuals assigned in the same categorical variables to examine the role of the quantization branch of ResCAL. Concretely, we extract the quantized vectors with the dimensions of the number of categorical variables from the GumbelSoftmax operation and assign the two residuals in the same pattern if both vectors have identical values for 32 categorical variables. Fig. 7 shows the prediction results of STGCN on METRLA for the two different input sequences, each from the train dataset and test dataset. The quantized vectors for each pair in Fig. 7 (a), (b), and (c) belong to the same pattern. Notable implication comes from the fact that the residuals assigned to the same category show a similar pattern of the sequences. The results for the train dataset (top) and test dataset (bottom) share a similar ground truth with a similar strategy of calibration. That is, our ResCAL can provide the interpretable evidences for calibration. This is crucial for realtime traffic forecasting since the calibration of residuals is mostly needed in the case of unintended failures, and case studies are essential for maintenance of the system.
Models  15min  30min  60min  
MAE  RMSE  MAPE  MAE  RMSE  MAPE  MAE  RMSE  MAPE  
STGCN (YuYZ18) 
3.33  5.77  8.92%  3.74  6.65  10.39%  4.32  7.72  12.52% 
STGCN  3.43  5.87  9.09%  3.85  6.85  10.84%  4.40  7.95  12.94% 
STGCN  3.46  5.84  9.00%  3.90  6.75  10.56%  4.46  7.87  12.77% 
STGCN + Ours  2.98  5.28  7.73%  3.47  6.26  9.46%  4.10  7.36  11.62% 
GraphWaveNet (WuPLJZ19) 
2.88  5.10  7.30%  3.33  6.03  8.93%  3.91  7.01  10.84% 
Graph WaveNet  2.90  5.13  7.23%  3.37  6.11  8.72%  4.01  7.22  10.57% 
Graph WaveNet  3.31  5.55  8.43%  3.62  6.29  9.66%  4.13  7.17  11.26% 
Graph WaveNet + Ours  2.88  5.09  7.30%  3.33  6.02  8.85%  3.87  6.99  10.66% 

6.4. Tuning Analysis
In the tuning analysis, we show that the performance gain of ResCAL is mostly based on the consideration of residuals rather than on external factors (e.g., introducing additional parameters). As an addon module, our ResCAL introduces additional parameters compared to solely using a baseline. In addition, ResCAL uses the residuals from previous predictions, thus observing a wide range of input sequences. For scrutiny, we conducted an experiment to check whether the performance gains can be obtained by simply increasing the model complexity or increasing the input length without using ResCAL. We introduced two variations of the baselines: one with larger parameters (LH) and one with longer input sequences (LI). Concretely, we extend the channels of the convolutional layers to increase the number of parameters as much as our ResCAL () for LH models and provide longer input sequences for the LI models. Table 5 shows the results of the two variations and the original with our ResCAL on METRLA. Interestingly, even with the larger parameters and wide ranges of usable data, the baselines cannot reproduce the results of our ResCAL. In Graph WaveNetLH, introducing the larger parameters seems to improve the performance in terms of MAPE. However, the performance of MAE and RMSE significantly decreased, even compared to the original Graph WaveNet. In contrast, the model with our ResCAL shows a stable performance improvement in all cases.
6.5. Sanity Check on Clustering
To validate the reliability of our quantization branch, we performed a sanity check by assessing the patterns with lower occurrences. Fig. 8 highlights of the time steps assigned to the least frequent patterns. From this, we can observe that abnormal residuals mainly occur in the selected regions. This demonstrates that our quantization branch can capture minor patterns faithfully without being biased toward dominant events.
7. Conclusion
In this work, we introduce a realtime setting in traffic forecasting and show that the residuals occurring from deeplearningbased models are highly correlated both spatially and temporally. To fully consider the autocorrelation of the residuals, we present a modelagnostic addon module named ResCAL, which calibrates the predictions by estimating the residuals. Without having to fix the original architectures, which may entail high computational cost, our ResCAL module can effectively capture spatialtemporal dependencies of the residuals. On METRLA and PEMSBAY, our ResCAL consistently shows performance improvements for the existing forecasting models in event situations. Furthermore, we demonstrate the high practicality of our ResCAL in realtime traffic forecasting by providing the interpretability for calibration and effectively calibrating the predictions with significant errors. While we focus on traffic datasets, our proposed approach can be freely adopted to various tasks with autocorrleated residuals. We hope that our findings will shed light on future research for other forecasting problems as well as traffic forecasting.
Acknowledgements.
This work was supported by the Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2019000075, Artificial Intelligence Graduate School Program (KAIST)), and the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. NRF2022R1A2B5B02001913). This work was also partially supported by NAVER Corp.