1 Introduction
With steadily increasing urbanization in major cities in the United States, city planning and statelevel department of transportation organizations are now focusing heavily on traffic management systems. The complexity of traffic management has, in the past, been too high a bar for most controloriented technology solutions. Signal control, variable messaging signs, and traffic metering onto highways have managed at a localized level in the past and sufficed. With increased densities of vehicles on the road network, however, these solutions are falling short, and traffic is having a detrimental impact on area economics, productivity, emissions, and quality of life. As an example, the costs associated with congestion in California have been estimated to be on the order of $29B, with approximately $1.7B attributed to each of the Los Angeles (LA) and San Francisco (SFO) regions alone
[trip2018].New data collection opportunities and advanced computing capabilities are beginning to emerge in transportation systems. Most of these are being deployed in a “corridor environment,” where congested sections of major highways are instrumented to collect a variety of traffic measurements. These measurements are fed back to traffic management centers (TMCs) and decisions systems that, with the approval of humans in the loop, provide timely alerts to authorities and control traffic with cost measures that provide the best outcomes for the corridor, such as congestion metrics, fuel efficiency, and emission. Beginning in the 1980s, TMCs have spread widely across the United States, with over 280 TMCs operating in the nation today. The decision systems in use by TMCs are a key component of an efficiently operating transportation system and the subject of significant research.
In parallel to the emerging data collection and computing capabilities afforded to TMCs, techniques to better forecast traffic conditions—a foundational component of a TMC’s decision system—are also seeing significant improvements to the state of the art. Statistical techniques such as autoregressive [williams2003modeling]
and Kalman filtering
[kumar2017traffic] methods are commonly used for time series forecasting. However, these models work well only for stationary time series data [al2011selecting]. Machine learning methods such as artificial neural networks
[chan2012neural, karlaftis2011statistical][castro2009online, ahn2016highway]have been shown to outperform classical methods on highly nonliner and nonstationary traffic data. Recently, deep learning methods such as deep belief networks
[huang2014deep]and stacked autoencoders
[lv2015traffic], recurrent neural network variants [ma2015long, fu2016using], have emerged as promising approaches because of their ability to capture the longterm temporal dependencies. However, they do not model the spatial dependencies of highway network. Convolutaional neural networks with recurrent units have been investigated to model the spatial temporal dynamcis of traffic, where the spatial temporal traffic data are converted to a sequence of images and sequencetosequence learning is performed on the images [ma2017learning]. However, these methods pose significant limitations because they violate the fundamental nonEuclidean property of the network data. While the nearby pixels are correlated in images, nearby locations in a highway can be different because they can be on the opposite side of the highway (for example, one going into the city and one coming out of the city). To that end, Diffusion Convolutional Recurrent Neural Network (DCRNN) [li2017diffusion] has been proposed to overcome the challenges of spatiotemporal modeling of highway traffic networks. It is a stateoftheart graphbased neural network that captures spatial correlation by a diffusion process on a graph and temporal dependencies using a sequencetosequence recurrent neural network. These datadriven forecasting methods can help better utilize the transportation resources, develop policies for better traffic control, optimize routes, and reduce incidents and accidents on the road.The high performance of the deep learning models including DCRNN can be attributed to the availability of significant amounts of historical data for training. While these deep learning models can be transformational for the performance of a transportation decision system, many U.S. states do not have the data collection infrastructure that can provide such historical data for training. Moreover, the dynamic nature of transportation systems dictates that even for systems with highly instrumented corridors, other areas of emerging congestion may not have sufficient instrumentation or historical data. Relatedly, while several new data collection infrastructures have been deployed in various states, the time required for data collection can hinder model development and deployment. In the absence of an installed data collection infrastructure or areas where sensors are not as geospatially dense, probe data collected from GPS and cellphones can be used as a proxy to measure key traffic metrics such as speed. In these cases, model training can become difficult because historical data either may be unavailable or cannot be accumulated for privacy concerns. These challenges, combined with the improved traffic forecasting capabilities afforded by the DCRNN approach, suggest that identifying methods to deploy models trained in areas rich in historical data to areas with a paucity of data are especially promising. Additionally, if successful, these methods can allow states, cities, and municipalities to more quickly develop improved traffic forecasting capabilities with a significantly smaller infrastructure investment. Expanding the benefits from just TMCs, many other intelligent transportation applications, such as dynamic routing for freight traffic to congestion pricing based on forecast traffic conditions, could also benefit from localized models trained on data sets from more datarich locations.
Transfer learning is a promising approach to address the data paucity problem. In this approach, a model trained for one task is reused and/or adapted for a related task. While transfer learning is widely used for image classification, sentiment analysis, and document classification in the text domain
[zhuang2019comprehensive, tan2018survey], it has received less attention in the traffic forecasting domain. Using transfer learning methods for graphbased highway traffic forecasting such as DCRNN is not a trivial task. The reason is that graphs have complex neighborhood correlations as opposed to images, which have a relatively simple local correlations as they are samples from the same Euclidean grid domain [zhou2018graph]. Moreover, DCRNN cannot perform transfer learning because it learns the locationspecific spatial temporal patterns in the data and requires the same highway network graph for both training and inference.To address these issues, we have developed TLDCRNN, a DCRNN with transfer learning capability. Given a large highway network with historical data, TLDCRNN
partitions the network into a number of regions. At each epoch, regionspecific graphs and its corresponding time series data are used to train a single encoderdecoder model using minibatch stochastic gradient descent. Consequently, the locationspecific traffic patterns are marginalized, and the model tries to learn the traffic dynamics as a function of graph connectivity and temporal pattern alone. We conduct extensive experiments on the realworld traffic dataset of the entire California road network from the Performance Measurement System (PeMS) administered by the California Department of Transportation (Caltrans). PeMS is one of the first system in the United States to instrument and collect data on the highways at scale. The system has been used for operational analysis, planning, and research studies for almost 20 years. Our contributions are as follows:

We develop a new graphpartitioningbased transfer learning approach for diffusion convolution recurrent neural network that learns from datarich regions of the highway network and makes shortterm forecasts for unseen regions of the network.

By marginalizing locationspecific information with transfer learning, we show that it is feasible to model the traffic dynamics as a function of temporal patterns and network connectivity.

We demonstrate that our proposed transfer learning method can learn from different regions of the California highway network and can forecast traffic on unseen regions. We show the efficacy of the method by learning from the San Francisco (SFO) region data and forecasting for the Los Angeles (LA) region and vice versa.
2 Problem setup
The short term highway traffic forecasting problem can be defined on a weighted directed graph , where is a set of nodes that represent highway sensor locations, is the set of directed edges connecting these nodes, and is the weighted adjacency matrix that represents the connectivity between the nodes in terms of highway network distance. The traffic state at time step is represented as a graph signal on the graph , where is the number of traffic metrics of interest (e.g., traffic flow, traffic speed, and density that change over time). Given historical observations of the traffic state and observations of the current traffic state on the graph , where , the goal is to develop a model that can forecast the traffic state of the next time steps on all nodes of the graph, .
Let be the graph with nodes that represents the highway network for which we do not have the historical time series data. Given observations of the current traffic state on the graph , the goal is to develop a model that can forecast the traffic state of the next time steps on all nodes of the graph , .
In the context of Pan and Yang’s transfer learning classification [pan2009survey], our problem setup corresponds to the transductive transfer learning setting, where the source and target tasks are the same (shortterm traffic forecasting on graphs), while the source and target (unseen regions of the highway network) domains are different but related.
3 Diffusion Convolution Recurrent Neural Network (DCRNN)
DCRNN is a stateoftheart method for short term traffic forecasting [li2017diffusion]
. It is an encoderdecoder neural network architecture that performs sequencetosequence learning to carry out multistep traffic state forecasting. A simple and powerful variant of recurrent neural networks, called gated recurrent units (GRUs)
[cho2014learning] is used to design the encoderdecoder architecture. The matrix multiplications in GRUs is replaced with a diffusion convolution operation to make the DCRNN cell. In an layered DCRNN architecture, each layer consists of number of DCRNN cells. The DCRNN cell is defined by the following set of equations:where and denote the input and final state at time , respectively; , , and are the reset gate, update gate, and cell state at time , respectively; denotes the diffusion convolution; and are parameters for the corresponding filters. The diffusion convolution operation over the input graph signal and convolution filter , which learns the representations for graphstructured data during training, is defined as
(1) 
where is a maximum number of diffusion steps; and are transition matrix of the diffusion process and the reverse one, respectively; and are the indegree and outdegree diagonal matrices, respectively; and and are the learnable filters for the bidirectional diffusion process. The indegree and outdegree diagonal matrices provide the capability to capture the effect of the upstream as well as the downstream traffic. The driving distances between sensor locations are used to build the adjacency matrix ; a Gaussian kernel [shuman2012emerging] and a threshold parameter are used to sparsify .
During the training of DCRNN, a minibatch of time series sequence, each of length from historical time series data
, is given as an input to the encoder. The decoder receives a fixedlength hidden representation of the data from the encoder and forecasts the next
time steps for each sequence in the minibatch. The layers of DCRNN are trained by using backpropagation through time. DCRNN learns the weight matrices in Equation
1by minimizing the mean absolute error (MAE) as a loss function.
4 Transfer learning DCRNN
Similar to the convolution operation on images, the diffusion convolution on a graph is designed to capture patterns that are local to a given node [atwood2016diffusion]. This operation learns the latent representation of a diffusion process that starts from a given node to the neighboring connected node, and it is particularly suitable to capture the local diffusion behavior of traffic dynamics. From Equation 1, we see that DCRNN becomes locationspecific because of the presence of the weighted adjacency matrix in the diffusion step. As a result, the convolution filter is dependent on the given , which is kept constant throughout the training process. Our hypothesis is that if we change the graph and the corresponding time series data during the training process, then we can make diffusion convolution filters generic as opposed to locationspecific. Consequently, the resulting model is more generalizable and can be used to forecast traffic on unseen graphs.
A highlevel overview of the proposed TLDCRNN is shown in Figure 1. Given the source graph with the historical data , TLDCRNN first partitions it into a subgraphs with equal numbers of nodes using a graph partitioning method that takes only the weighted adjacency matrix of . Let and are the set of subgraphs and their corresponding time series data. The minibatch stochastic gradient update of TLDCRNN is same as that of DCRNN, where for a given , a batch of input and output time series sequence from is used to compute the errors and update the weights of the encoder and decoder architecture by backpropagation through time. The minibatch is constructed to preserve time ordering, a common approach in sequencetosequence time series modeling. The subgraph epoch for a given uses all the data in as a series of minibatches to update the parameters of the encoder decoder architecture. The epoch runa a subgraph epcoh for each .
For inference, TLDCRNN partitions the target graph into subgraphs such that each subgraph has nodes using the graph partitioning method. Given the current state of the traffic as a sequence of time series on a subgraph , the TLDCRNN trained model forecasts the traffic for the next time steps.
From the learning perspective, the key difference between DCRNN and TLDCRNN is that the former learns the locationspecific spatial temporal patterns on a static graph whereas the latter marginalizes the locationspecific information and learns the spatial temporal patterns across multiple subgraphs. Consequently, the model weights are trained in such way that it can provide generalization to similar but unseen graphs, a capability that can be used for forecasting on a unseen highway network graph.
Note on graph partitioning:
Several graph partitioning methods exist that can be leveraged for partitioning the graphs in the training and inference phase. Typically, these methods cannot partition the graph into equalsized subgraphs, but they can give partitions of similar sizes. In these cases, we can find the largest size of the graph and perform zero padding for all other subgraphs. The unseen graph
does not need to be as large as . For a given , when modulo value is relatively smaller than , one can use zero padding to the subgraph whose size is smaller than . When the modulo value is much smaller than , the method of subgraphs with overlapping nodes [mallick2019graphpartitioningbased] can be adopted, where a subgraph includes nodes from neighboring geographically close subgraphs.Assumptions and implications
We assume that the graph represents a large highway network such that it is amenable to graph partition and, in particular, the partitioned subgraphs expose a range of traffic dynamics. Otherwise, the transfer learning ability of the method will suffer. Furthermore, the transductive transfer learning assumptions of related domain apply to our method. The highway traffic data can be seen as the samples generated from a highdimensional distribution parameterized by highway vehicle composition, infrastructure, and entry exit dynamics, among others. Given completely different distribution sample, the model will not be able to perform transfer learning with high accuracy. For example, a model that is trained on certain regions of California will not generalize to highway network traffic forecasting in completely different regions of the world such as China and India, where the highway vehicle composition, infrastructure, and traffic dynamics can be dramatically different.
5 Experimental results
Our dataset has 11,160 sensors for the entire year of 2018. Besides the time series data, PeMS captures spatial information such as the latitude and longitude of each station. We computed the pairwise driving distances between the sensors using the latitude and longitude and built the adjacency matrix using a thresholded Gaussian kernel as prescribed in [li2017diffusion].
To generate subgraphs for the TLDCRNN training, we partitioned the whole California highway traffic network graph into 64 roughly equalsized subgraphs using Metis’s way partitioning method [metis]. Previously, we had shown that instead of training the full California network, decomposing the large graph into 64 subgraphs and training 64 DCRNNs independently results in better accuracy [mallick2019graphpartitioningbased]. We divided the 64 subgraphs into two equal sets: source subgraphs (numbered from 1 to 32) and target subgraphs (numbered from 33 to 64). The 32 subgraphs in the source subgraphs were selected systematically to ensure uniform coverage over entire California network: the first 8 subgraphs were selected from eight districts or locations of California: North Central, Bay Area, South Central, Los Angeles, San Bernardino, Central, San Diego, and Orange County. We randomly selected 16 subgraphs from the remaining subgraphs without replacement. All the unselected subgraphs were grouped in the target subgraphs. From one year of the time series data, we use 70% of the data (36 weeks) for training, 10% (5 weeks) for validation, and 20% (10 weeks) for testing. Consequently, we have six datasets. See Table 1 for the dataset summary and nomenclature adopted.
Dataset  Description 

source subgraphs ()  
trainsrc  36 weeks of time series data (1 Jan. 2018 to 13 Sept. 2018) 
valsrc  5 weeks of time series data (13 Sept. 2018 to 20 Oct. 2018) 
testsrc  10 weeks of time series data (20 Oct. 2018 to 31 Dec. 2018) 
target subgraphs ()  
traintgt  36 weeks of time series data (1 Jan. 2018 to 13 Sept. 2018) 
valtgt  5 weeks of time series data (13 Sept. 2018 to 20 Oct. 2018) 
testtgt  10 weeks of time series data (20 Oct. 2018 to 31 Dec. 2018) 
Because of the fixed dimension of the input and adjacency matrix used in the TLDCRNN, all subgraphs should contain an equal number of nodes. We added rows and columns filled with zeros to make all the adjacency matrices exactly equal in size.
We refer the reader to the supplement page of the paper for hardware configurations, software environments, library versions, hyperparameters settings, data preparation, and data prepreprocessing methods. We note that the hyperparameter values were set as default values for the opensource DCRNN implementation
[li2017diffusiongit].We used speed as the traffic forecasting metric. For training and inference, the forecast horizons were set to 60 minutes, respectively: the encoder gets 60 minutes of traffic data (time and speed) (12 observations, one for every five minutes), and the decoder outputs the forecasts for the next 60 minutes (12 predictions, one for every five minutes). MAE was used as the training loss and test accuracy metric for comparing the methods.
We computed pairwise MAE difference values because the MAE values of and are computed on the same set of nodes. We concluded that is better than
. We adopted onesided null hypothesis that the median difference between the two MAE distributions is greater than or equal to zero (MAE value from
is lower than or similar to that of ). We computed value from the test and rejected the null hypothesis when the value was less than (at a confidence level of 5%) in favor of the alternative that the median is less than zero (MAE value from is lower than those of ).5.1 Impact of number of subgraphs and epochs
We also conducted an exploratory analysis to study the impact of the number of subgraphs and epochs on the forecast accuracy. We show in this section that TLDCRNN benefits from the adoption of a large number of subgraphs and epochs for training.
We used three values for the number of subgraphs—8 (1 to 8), 16 (1 to 16), and 32 (1 to 32)—and their corresponding time series data. All models were trained for 1 epoch on trainsrc. Approximately 6 minutes were needed to train each subgraph in each epoch. Therefore, for 8, 16, and 32 subgraphs, training took approximately 48, 96, and 192 minutes, respectively.
Using boxwhisker plots, we showed in Figure 2 the pairwise MAE difference between 32 and 8 and 32 and 16 subgraphs on 5,579 nodes in valsrc from the three models . A difference value below zero indicates that 32 subgraph MAE is lower than the other. We can see that 32 subgraph MAE values are lower than those of 8 and 16. We also compared the MAE values obtained from 8 and 16 subgraphs with the onesided paired Wilcoxon test,. The results show that the latter is better than the former (value of ). We found that 4,051 out of 5,579 nodes have lower MAE values when increasing the subgraphs from 8 to 16. Similarly, the onesided Wilcoxon signedrank test confirmed the improvement in MAE values when 32 subgraphs are adopted for training (value of ). We observed that 3,686 out of 5,579 nodes in the valsrc have lower MAE values when increasing the number of subgraphs from 16 to 32. Therefore, we adopted all 32 subgraphs for training (5,579 nodes) for the rest of the experiments.
Next, we increased the number of training epochs of TLDCRNN and analyzed its impact on forecasting accuracy. We trained TLDCRNN on trainsrc for 1, 10, 20, and 30 epochs; the model training took approximately 192, 1920, 3840, and 5760 minutes, respectively. We evaluated the trained model on valsrc. The distributions of the pairwise MAE difference between 30 and 1, 30 and 10, and 30 and 20 epochs for 5,579 nodes are shown in Figure 3. The paired onesided Wilconxon test obtained values of , , and for the comparison of 1 and 10, 10 and 20, and 20 and 30 epochs, respectively, confirming that the null hypothesis can be rejected at a confidence level of 5%. For the rest of the experiments,we therefore adopted 30 epochs for training.
5.2 Transfer learning on California network
Here, we showed that the proposed subgraph training strategy of TLDCRNN for transfer learning is effective.
We trained and tested TLDCRNN on trainsrc and testtgt data, respectively. We compared this approach with DCRNN, where we trained a subgraphspecific model for each subgraph in trainsrc. This resulted in 32 subgraphspecific trained models. We refer to this method as SSDCRNN. We adopted three test modalities for SSDCRNN. First, we evaluated each of the 32 SSDCRNN models on the valtgt and selected the one with the lowest MAE for testing on testtgt (single model for testing referred as SSDCRNNS). Second, for each subgraph in testtgt, we evaluated the 32 SSDCRNN models on the valtgt and selected the model with the lowest MAE for testing on testtgt (multiple models for testing referred as SSDCRNNM). Third, we adopted a bagging approach, where forecasting for each subgraph in testtgt is given by the average of 32 model forecasts (bagged models for testing are referred to as SSDCRNNB). Note that SSDCRNNS and SSDCRNNM introduce a bias in favor of SSDCRNN since the model selection is made on the data from the target subgraphs
Figure 4 shows the distributions of the pairwise MAE differences obtained on testtgt by different methods. We observe that TLDCRNN outperforms the SSDCRNN variants. The results show that the MAE values obtained by TLDCRNN are lower than SSDCRNNS, SSDCRNNM, and SSDCRNNB, on 5,188, 4,577, and 3,416 out of 5,581 nodes, respectively. The observed differences are significant according to a paired onesided Wilcoxon signedrank test, which showed values of (TLDCRNN vs SSDCRNNS), (TLDCRNN vs SSDCRNNM), and (TLDCRNN vs SSDCRNNB).
Given that TLDCRNN and SSDCRNN differs only with respect to the model training, the superior performance can be attributed to the proposed subgraphbased training strategy. The comparison between SSDCRNNB and SSDCRNNM shows that the MAE values obtained by SSDCRNNM is lower than those of SSDCRNNS on 3,095 out of 5,581 nodes. This is because the subgraphspecific model selection results in a collection of models, which provide more robust forecasts than a single model. The MAE values obtained by SSDCRNNB are worse than those of TLDCRNN, SSDCRNNB, and SSDCRNNM. Unlike SSDCRNNS and SSDCRNNM, subgraphspecific models in SSDCRNNB do not have any advantage on unseen graphs. While bagging good models increases the accuracy, the same approach with poor models can significantly degrade the accuracy [opitz1999popular].
5.3 Direct learning on California network
Here, we showed that when the same subgraphs are used for training and testing, SSDCRNN achieves results that are better than those of TLDCRNN.
We trained SSDCRNN and TLDCRNN on trainsrc and tested them on testsrc. The TLDCRNN model does not learn subgraphspecific traffic dynamics, resulting in one model, whereas SSDCRNN learns subgraphspecific traffic dynamics, resulting in 32 models.
Figure 5 shows the distribution of pairwise differences between the MAE values of TLDCRNN and SSDCRNN obtained on the 32 partitions in testsrc. Out of 32 subgraphs, the medians of the distributors are positive for 26 subgraphs. The results show that the MAE values obtained by SSDCRNN are lower than those obtained by TLDCRNN, on 3,412 out of 5,579 nodes in the testsrc data. The observed differences are significant according to a paired onesided Wilcoxon signedrank test, which showed a value of . Superior performance of the 32 models of SSDCRNN can be attributed to the fact that they take into account the locationspecific information for learning the traffic dynamics. In singlemodel TLDCRNN the locationspecific spatiotemporal learning is traded off to achieve transfer learning capability. In particular, the inductive bias due to locationspecific training results in generalization, which is stronger than that of the transfer learning SSDCRNN under the adopted experimental settings.
5.4 Transfer learning between LA and SFO
In Section 5.2 we selected the subgraphs for training to have uniform coverage on the districts of California state. Here, we compare TLDCRNN and SSDCRNN on the two major districts of California and demonstrate that TLDCRNN model trained on LA (SFO) can be used for forecasting SFO (LA).
In the PeMS system, the LA and SFO districts have 2,716 and 2,382 sensor locations, respectively. We partitioned the highway traffic network of LA and SFO into 15 and 13 subgraphs, respectively, to keep the number of nodes per partition as close as 64 partitions used in Section 5.2. We trained two TLDCRNN models on trainsrc of LA and SFO and tested them on testtgt of SFO and LA, respectively. We used the same timeline for training, validation, and testing as shown in Table 1. We adopted the three test modalities for SSDCRNN (as discussed in Section 5.2): SSDCRNNS, SSDCRNNM, and SSDCRNNB. Moreover, we included direct learning in the analysis, where SSDCRNN was trained and tested on the same subgraphs of LA (SFO).
Figure 6 shows the distributions of the pairwise MAE differences between SSDCRNN methods and TLDCRNN when the models were trained on SFO and tested on LA (except direct learning DCRRN, which was trained on LA). We observe that TLDCRNN outperforms SSDCRNNS, SSDCRNNM, and SSDCRNNB. For LA, the MAE values obtained by TLDCRNN are lower than those obtained by SSDCRNNS, SSDCRNNM, and SSDCRNNB on 1,467, 2,195, and 1,553 out of 2,717 nodes, respectively. The paired onesided Wilcoxon signedrank test shows values of (TLDCRNN vs SSDCRNNS), (TLDCRNN vs SSDCRNNM), and (TLDCRNN vs SSDCRNNB). The trend is similar for the results on SFO, shown in Figure 7. The MAE values obtained by TLDCRNN are lower than those of SSDCRNNS, SSDCRNNM, and SSDCRNNB, on 1,237, 1,546, and 1,297 out of 2,383 nodes, respectively. The observed differences are significant according to a paired onesided Wilcoxon signedrank test, which shows values of (TLDCRNN vs SSDCRNNS), (TLDCRNN vs SSDCRNNM), and (TLDCRNN vs SSDCRNNB).
Similar to the results in Section 5.3, direct learning SSDCRNN obtains MAE values lower than those of TLDCRNN.
To gain further insight into the observed TLDCRNN
errors, we computed, for each node, the coefficient of variation given by the ratio of standard deviation and mean of the time series in
testtgt. This measure can be used as a proxy to measure the traffic dynamics: smaller values indicate that the speed is stable (less dynamic), and larger values mean that a wide range of speed values have been observed (more dynamic). Figure 8 shows the distribution of pairwise MAE differences between TLDCRNN vs SSDCRNN as a function of coefficient of variation intervals. On LA data, we can see a clear trend in which an increase in the coefficient of variation values increases the MAE difference values. SSDCRNN forecasts become more accurate than that of TLDCRNN. This is becasue SSDCRNN with 32 models trained and tested on the same graph captures the locationspecific traffic dynamics better than does SSDCRNN with single model that was trained on the SFO dataset. We can observe a similar trend in the SFO dataset as well. The significant drop in the pairwise MAE differences for coefficient of variation larger than 0.4 can be attributed to small number (2) of nodes. The number of nodes with coefficient of variation larger than 0.3 is rather small () when compared to the LA dataset. This shows that the SFO traffic is less dynamic when compared with LA traffic.5.5 Comparison with other methods
In this section, we present our comparison of TLDCRNN with other shortterm traffic forecasting methods proposed in the literature. We show that despite not being trained on the same highway network graph and data, TLDCRNN achieves an accuracy that is better than or comparable to the accuracy of the other traffic forecasting methods.
Specifically, we compared TLDCRNN with the following methods: (1) autoregressive integrated moving average (ARIMA) [makridakis1997arma], which considers only the temporal relationship of the data; (2) support vector regression (SVR) [wu2004travel]
, a linear support vector machine for forecasting, (3) a feedforward neural network (FNN)
[raeesi2014traffic] with two hidden layers, (4) fully connected LSTM (FCLSTM) [sutskever2014sequence] with encoderdecoder architecture, (5) a spatiotemporal graph convolutional network (STGCN) [yu2017spatio], which combines graph convolutions and gated temporal convolutions, (6) a diffusion convolutional recurrent neural network (DCRNN), as discussed in Section 3, (7) Graph waveNet [wu2019graph], a CNNbased method with stacked dilated casual convolutions for handling temporal dependencies, and (8) a graph multiattention network (GMAN) [zheng2019gman], an encoderdecoder architecture with multiple spatiotemporal attention. For this comparison, we used the accuracy results reported in [zheng2019gman], where GMAN results are compared with other methods.All the methods were benchmarked on the PEMSBAY dataset with 325 sensors in the Bay Area with 6 months of time series data ranging from Jan. 1, 2017 to June 30, 2017. We used 70% of the data for training (Jan. 1, 2017, to May 7, 2017), 10% of the data was used for validation (from May 7, 2017, to May 25, 2017), and 20% was used for testing (from May 25, 2017, to June 30, 2017). We trained TLDCRNN on the LA dataset on the same timeline and tested on the PEMSBAY dataset.
In addition to MAE, we used root mean wquare error (RMSE) and mean absolute percentage error (MAPE) metrics to compare the accuracy of these models. The comparison results are shown in Table 2. We observe that TLDCRNN achieves MAE of 2.13, which is better than ARIMA (3.38), SVR (3.28), FNN (4.46), FCLSTM (2.37), and STGCN (2.49). The trend is similar for RMSE and MAPE. Although DCRNN, Graph Wevenet, and GMAN were trained on the PEMSBay dataset, their accuracy metrics were not significantly better than those obtained by TLDCRNN trained on the LA dataset.
Method  MAE  RMSE  MAPE 
Training and testing on PEMSBAY  
ARIMA  3.38  6.50  8.30% 
SVR  3.28  7.08  8.00% 
FNN  2.46  4.98  5.89% 
FCLSTM  2.37  4.96  5.70% 
STGCN  2.49  5.69  5.79% 
DCRNN  2.07  4.74  4.90% 
Graph Wevenet  1.95  4.52  4.63% 
GMAN  1.86  4.32  4.31% 
Training on LA and testing on PEMSBAY  
TLDCRNN  2.13 1.09  5.23 2.29  5.55 4.34 
6 Related work
Recently, graphconvolutionbased forecasting models have shown a significant improvement in traffic forecasting tasks over classical forecasting approaches such as ARIMA and Kalman filtering, which are not effective in capturing complex spatial and temporal correlations [williams2003modeling, chan2012neural, karlaftis2011statistical, castro2009online]. Cui et al. [cui2018traffic]
developed a graph convolutional long shortterm memory network. They used the graph convolution operation inside the LSTM cell with regularization approaches. Yu et al.
[yu2017spatio] integrated graph convolution and gated temporal convolution in a spatiotemporal convolutional block for traffic forecasting. Li et al. [li2017diffusion] proposed a DCRNN method that models traffic state as a diffusion process on a graph and used it within a GRU cell. All these methods, however, cannot perform forecasting on unseen graphs because they learn locationspecific traffic dynamics and require the same highway network for both training and inference.The prior work on transfer learning for shortterm traffic forecasting is sparse. Wang et al. [wang2018cross] proposed an imagebased convolutional LSTM network to perform transfer learning for crowd flow prediction from a datarich city to a datascarce city. The method first learns a matching function using Pearson correlation to find a similar source city for each target city. During training of the network the method tries to minimize the hidden representations of the target region and its matched source region inside the loss function. This approach does not incorporate multiple nodes, however, and does not take into account the spatial graph dependency. Recently, Yao et al. [yao2019learning] proposed a metalearning method for traffic volume and water quality prediction. This approach captures knowledge from multiple nodes. It uses an imagebased convolutional LSTM network to train on multiple source nodes and uses those trained weights for prediction on the target nodes. The method uses spatialtemporal memory to store representation of diffident regions of the source cities. The regions are found by means clustering on the averaged 24hour patterns of each region, and regionspecific weights stored in the memory are utilized for prediction via an attention mechanism. Krishnakumari et al. [krishnakumari2018understanding]
developed a method that first clusters the feature vectors obtained from the pretrained imagebased convolutional network and then uses the cluster to predict onestep forecast for the similar target location using an ensemble of multiple models such as multilayer perceptron, random forest, Knearest neighbor, support vector machine (SVM), and Gaussian process. Xu1 et al.
[xu2016cross] and Lin [lin2018transfer] conducted preliminary studies for traffic prediction using cross city transfer leaning using SVM and dynamic time warping, respectively. Fouladgar et al. [fouladgar2017scalable] proposed a transfer learning method using an imagebased convolutional network and LSTM for traffic forecasting in case of congestion. None of these methods use graph convolution to model the spatial dependencies, and they cannot be applied directly to shortterm highway forecasting.TLDCRNN is inspired by the clusterGCN [chiang2019cluster] training, where a graph convolution network training is proposed for learning tasks on large graph classification problem. In this approach, each batch for stochasticgradientdescentbased training uses samples of subgraphs of the original graph. ClusterGCN is built for node and link classification on graphs, however, and it cannot be used to model graph diffusion and temporal characteristics of the traffic data.
7 Conclusion and future work
We developed TLDCRNN, a graphpartitioningbased transfer learning approach for the diffusion convolution recurrent neural network to forecast short term traffic on a highway network. TLDCRNN partitions the source highway network into a number of regions and learns the spatiotemporal traffic dynamics as a function of the traffic state and the network connectivity by marginalizing the locationspecific patterns. The trained model from TLDCRNN is then used to forecast traffic on unseen regions of the highway network. We demonstrated the efficacy of TLDCRNN using one year of California traffic data obtained from the Caltrans Performance Measurement System. Moreover, we showed that a model trained with the TLDCRNN approach can perform transfer learning between LA and SFO regions. TLDCRNN outperformed popular methods used for largescale traffic forecasting (autoregressive integrated moving average, support vector regression, feedforward neural network, fully connected LSTM, spatiotemporal graph convolutional network) despite being applied to a region unseen in training, whereas the other methods were both trained and applied on the same region. These results offer strong evidence that practitioners and researchers can begin applying stateoftheart forecasting methods such as TLDCRNN to their own regions even in the absence of significant amounts of historical data. Allowing practitioners to apply emerging datadriven methods trained on datasets collected elsewhere is a transformative capability, enabling a wide range of transportation system operations and functions to operate more efficiently and sustainably through improved forecasting at reduced infrastructure development costs.
Our future work will include (1) developing deployment strategies for Traffic Management Systems, which can vary across the country; (2) transfer learning capability for alternate data sources such as mobile device data to relieve the cost of installing infrastructure sensors; and (3) metalearning strategies for graphbased transfer learning for highway networks; (4) road network structural implications for extending this approach beyond highway implementations, which may include characterizing how graph constraints are codified in the DCRNN.
Acknowledgments
This material is based in part upon work supported by the U.S. Department of Energy, Office of Science, under contract DEAC0206CH11357. This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility under contract DEAC0206CH11357. This report and the work described were sponsored by the U.S. Department of Energy (DOE) Vehicle Technologies Office (VTO) under the Big Data Solutions for Mobility Program, an initiative of the Energy Efficient Mobility Systems (EEMS) Program. David Anderson and Prasad Gupte, the DOE Office of Energy Efficiency and Renewable Energy (EERE) managers played important roles in establishing the project concept, advancing implementation, and providing ongoing guidance.
References
Supplementary document
Data preprocessing
We downloaded the dataset from the official PeMS website [pems]. PeMS has been harvesting data from 18,000 sensors, including inductive loops, sidefire radar, and magnetometers. These sensors record the speed and flow of the traffic every 30 seconds; the recorded values are then aggregated at a 5 minutes granularity. Based on PeMS wedsite statistics, due to various types of failures, at any point of time, on average only 69.59% of the 18,000 sensors are in working condition. The senors report NULL values when they fail and we excluded those failed sensors from the dataset. As a result, we had 11,160 sensors in the final data for the entire year of 2018. Even on the selected 11,160 sensors, there were missing values for several time stamps in the time series data. We found that 0.06% (698,162 out of 1,173,139,200) data points had missing speed values. We filled the missing data by taking the mean of similar time and day of the week over a period of time. We separated holidays from normal working days.
The dataset is saved in hdf5 file format. Each column of the hdf5 is the time series data for a senor and the indices are the timestamps in 5 minutes frequency starting from the time 00:00:00, January, 2018 to 23.55.00, December, 2018. The data used for the experiments can be downloaded at the following link:
To fetch data efficiently for training, we created TFrecord dataset using the hdf5 file. The input pipeline fetches the data for the next batch before finishing the current batch. We use the tf.data API for this purpose. The script used to convert the hdf5 file to TFrecord dataset is given in following link:
Besides the time series data, PeMS captures spatial information such as the postmile markers for each sensor and latitude and longitude of each postmile markers. To get the latitude and longitude of each sensor, we matched the postmile markers. The latitude and longitude is found by linear interpolation if an exact match can not be found in the post mile markers. The latitude and longitude are used get the driving distance between the sensors. We used the Open Source Routing Machine (OSRM)
[osrm] docker solution to compute the road network distance. OSRM takes the latitude and longitude of two sensor IDs and compute the shortest driving distance between them. We limited the OSRM queries only to the 30 Euclidean nearest neighbors, which we precomputed for for sensor ID. The road network distance between the sensors are used to make the adjacency matrix for DCRNN. The computed adjacency matrix for 11,160 sensors is available at the following link:The training and test datasets are normalized by using a standard scalar method in scikitlearn [scikitlearn]. The normalized feature , where is the input features is the mean of the training samples, and is the standard deviation of the training samples. We applied inverse of the transform before computing the MAE on the test data set.
Hardware
For the experimental evaluation, we used Cooley, a GPUbased cluster at the Argonne Leadership Computing Facility. It has 126 compute nodes, each node consisting of two 2.4 GHz Intel Haswell E52620 v3 processors (6 cores per CPU, 12 cores total), one NVIDIA Tesla K80 (two GPUs per node), 384 GB of RAM per node, and 24 GB GPU RAM per node (12 GB per GPU). The compute nodes are interconnected via an InfiniBand fabric.
Software settings
We used Python 3.6.0, TensorFlow 1.3.1, NumPy 1.16.3, Pandas 0.19.2, and HDF5 1.8.17. We use Metis 5.1.0 for graph partitioning. Multilevel
way partitioning algorithm is used for the partitioning. This algorithm creates roughly equal sized partitions. It takes approximately 0.030 seconds to perform 64 partition on a graph of 11,160 nodes.Hyperparameters
We used the same hyperparameter values for all the training. These hyperparameter values were set as default values for the opensource DCRNN implementation [li2017diffusiongit]. They are batch size: 64; filter type: random walk; maximum diffusion steps: 2; number of RNN layers: 2; number of RNN units per layers: 16; threshold max_grad_norm to clip the gradient norm to avoid exploring gradient problem of RNN [pascanu2013difficulty]: 5; initial learning rate: 0.01 and learning rate decay: 0.1.
Source code
The code for TLDCRNN is available at the following URL along with the script to run it:
Comments
There are no comments yet.