Log In Sign Up

Domain Adversarial Spatial-Temporal Network: A Transferable Framework for Short-term Traffic Forecasting across Cities

Accurate real-time traffic forecast is critical for intelligent transportation systems (ITS) and it serves as the cornerstone of various smart mobility applications. Though this research area is dominated by deep learning, recent studies indicate that the accuracy improvement by developing new model structures is becoming marginal. Instead, we envision that the improvement can be achieved by transferring the "forecasting-related knowledge" across cities with different data distributions and network topologies. To this end, this paper aims to propose a novel transferable traffic forecasting framework: Domain Adversarial Spatial-Temporal Network (DASTNet). DASTNet is pre-trained on multiple source networks and fine-tuned with the target network's traffic data. Specifically, we leverage the graph representation learning and adversarial domain adaptation techniques to learn the domain-invariant node embeddings, which are further incorporated to model the temporal traffic data. To the best of our knowledge, we are the first to employ adversarial multi-domain adaptation for network-wide traffic forecasting problems. DASTNet consistently outperforms all state-of-the-art baseline methods on three benchmark datasets. The trained DASTNet is applied to Hong Kong's new traffic detectors, and accurate traffic predictions can be delivered immediately (within one day) when the detector is available. Overall, this study suggests an alternative to enhance the traffic forecasting methods and provides practical implications for cities lacking historical traffic data.


page 1

page 8


Explainable Graph Pyramid Autoformer for Long-Term Traffic Forecasting

Accurate traffic forecasting is vital to an intelligent transportation s...

Towards Good Practices of U-Net for Traffic Forecasting

This technical report presents a solution for the 2020 Traffic4Cast Chal...

Few-Shot Traffic Prediction with Graph Networks using Locale as Relational Inductive Biases

Accurate short-term traffic prediction plays a pivotal role in various s...

Graph Markov Network for Traffic Forecasting with Missing Data

Traffic forecasting is a classical task for traffic management and it pl...

Traffic Accident Risk Forecasting using Contextual Vision Transformers

Recently, the problem of traffic accident risk forecasting has been gett...

Recurrent Autoencoder with Skip Connections and Exogenous Variables for Traffic Forecasting

The increasing complexity of mobility plus the growing population in cit...

Dynamic Graph Convolutional Recurrent Network for Traffic Prediction: Benchmark and Solution

Traffic prediction is the cornerstone of an intelligent transportation s...

1. Introduction

Figure 1. An overview of the transferable traffic forecasting and its applications.

Short-term traffic forecasting (jiang2021graph; bolshinsky2012traffic) has always been a challenging task due to the complex and dynamic spatial-temporal dependencies of the network-wide traffic states. When the spatial attributes and temporal patterns of traffic states are convoluted, their intrinsic interactions could make the traffic forecasting problem intractable. Many classical methods (williams2003modeling; drucker1997support) take temporal information into consideration and cannot effectively utilize spatial information. With the rise of deep learning and its application in intelligent transportation systems (ITS) (dai2020hybrid; barnes2020bustr; zhang2020curb)

, a number of deep learning components, such as convolutional neural networks (CNNs)


, graph neural networks (GNNs)


, and recurrent neural networks (RNNs)

(fu2016using), are employed to model the spatial-temporal characteristics of the traffic data (song2020spatial; du2017traffic; guo2019attention; cai2020traffic; li2021dynamic). These deep learning based spatial-temporal models achieve impressive performances on traffic forecasting tasks.

However, recent studies indicate that the improvement of the forecasting accuracy induced by modifying neural network structures has become marginal (jiang2021graph), and hence it is in great need to seek alternative approaches to further boost up the performance of the deep learning-based traffic forecasting models. One key observation for current traffic forecasting models is that: most existing models are designed for a single city or network. Therefore, a natural idea is to train and apply the traffic forecasting models across multiple cities, with the hope that the “knowledge related to traffic forecasting” can be transferred among cities, as illustrated in Figure 1

. The idea of transfer learning has achieved huge success in the area of computer vision, language processing, and so on

(pan2009survey; patel2015visual; liu2019survey), while the related studies for traffic forecasting are premature (yin2021deep).

There are few traffic forecasting methods aiming to adopt transfer learning to improve model performances across cities (wang2018cross; yao2019learning; wang2021spatio; tian2021transfer). These methods partition a city into a grid map based on the longitude and latitude, and then rely on the transferability of CNN filters for the grids. However, the city-partitioning approaches overlook the topological relationship of the road network while modeling the actual traffic states on road networks has more practical value and significance. The complexity and variety of road networks’ topological structures could result in untransferable models for most deep learning-based forecasting models (mallick2021transfer). Specifically, we consider the road networks as graphs, and the challenge is to effectively map different road network structures to the same embedding space and reduce the discrepancies among the distribution of node embedding with powerful representation learning on graphs.

As a practical example, Hong Kong is determined to transform into a smart city. The Smart City Blueprint for Hong Kong 2.0 was released in December 2020, which outlines the future smart city applications in Hong Kong (HKsmart)

. Building an open-sourced

traffic data analytic platform is one essential smart mobility application among those applications. Consequently, Hong Kong’s Transport Department is gradually releasing the traffic data starting from the middle of 2021 (HKdata). As the number of detectors is still increasing now (as shown in Figure 2), the duration of historical traffic data from the new detectors can be less than one month, making it impractical to train an existing traffic forecasting model. This situation also happens in many other cities such as Paris, Shenzhen, and Liverpool (vivacitylabs2022), as the concept of smart cities just steps into the deployment phase globally. One can see that a successful transferable traffic forecasting framework could enable the smooth transition and early deployment of smart mobility applications.

Figure 2. Available detectors in Hong Kong in September 2021 (left) and January 2022 (right).

To summarize, it is both theoretically and practically essential to develop a network-wide deep transferable framework for traffic forecasting across cities. In view of this, we propose a novel framework called Domain Adversarial Spatial-Temporal Network (DastNet

), which is designed for the transferable traffic forecasting problem. This framework maps the raw node features to node embeddings through a spatial encoder. The embedding is induced to be domain-invariant by a domain classifier and is fused with traffic data in the temporal forecaster for traffic forecasting across cities. Overall, the main contributions of our work are as follows:

  • [leftmargin=*]

  • We rigorously formulate a novel transferable traffic forecasting problem for general road networks across cities.

  • We develop the domain adversarial spatial-temporal network (DastNet), a transferable spatial-temporal traffic forecasting framework based on multi-domains adversarial adaptation. To the best of our knowledge, this is the first time that the adversarial domain adaption is used in traffic forecasting to effectively learn the transferable knowledge in multiple cities.

  • We conduct extensive experiments on three real-world datasets, and the experimental results show that our framework consistently outperforms state-of-the-art models.

  • The trained DastNet is applied to Hong Kong’s newly collected traffic flow data, and the results are encouraging and could provide implications for the actual deployment of Hong Kong’s traffic surveillance and control systems such as Speed Map Panels (SMP) and Journey Time Indication System (JTIS) (tam2011application).

The remainder of this paper is organized as follows. Section 2 reviews the related work on spatial-temporal traffic forecasting and transfer learning with deep domain adaptation. Section 3 formulates the transferable traffic forecasting problem. Section 4 introduces details of DastNet. In section 5, we evaluate the performance of the proposed framework on three real-world datasets as well as the new traffic data in Hong Kong. We conclude the paper in Section 6.

2. Related Works

2.1. Spatial-Temporal Traffic Forecasting

The spatial-temporal traffic forecasting problem is an important research topic in spatial-temporal data mining and has been widely studied in recent years. Recently, researchers utilized GNNs (kipf2016semi; velivckovic2017graph) to model the spatial-temporal networked data since GNNs are powerful for extracting spatial features from road networks. Most existing works use GNNs and RNNs to learn spatial and temporal features, respectively (zhao2019t). Stgcn (yu2017spatio) uses CNN to model temporal dependencies. Astgcn (guo2019attention) utilizes attention mechanism to capture the dynamics of spatial-temporal dependencies. Dcrnn (li2017diffusion) introduces diffusion graph convolution to describe the information diffusion process in spatial networks. Dmstgcn (han2021dynamic) is based on Stgcn and learns the posterior graph for one day through back-propagation. Gman (zheng2020gman) uses spatial and temporal self-attention to capture dynamic spatial-temporal dependencies. Stgode (fang2021spatial) is an ODE-based spatial-temporal graph neural network for traffic forecasting. Although impressive results have been achieved by works mentioned above, a few of them have discussed the transferability issue and cannot effectively utilize traffic data across cities. For example, (lu2021learning) presents a multi-task learning framework for city heatmap-based traffic forecasting. (mallick2021transfer) leverage a graph-partitioning method that decomposes a large highway network into smaller networks and uses a model trained on data-rich regions to predict traffic on unseen regions of the highway network.

2.2. Transfer Learning with Deep Domain Adaptation

The main challenge of transfer learning is to effectively reduce the discrepancy in data distributions across domains. Deep neural networks have the ability to extract transferable knowledge through representation learning methods (yosinski2014transferable). (long2015learning) and (long2018transferable) employ Maximum Mean Discrepancy (MMD) to improve the feature transferability and learn domain-invariant information. The conventional domain adaptation paradigm transfers knowledge from one source domain to one target domain. In contrast, multi-domain learning refers to a domain adaptation method in which multiple domains’ data are incorporated in the training process (nam2016learning; yang2014unified).

In recent years, adversarial learning has been explored for generative modeling in Generative Adversarial Networks (

Gans) (goodfellow2014generative). For example, Generative Multi-Adversarial Networks (Gmans) (durugkar2016generative) extends Gans to multiple discriminators including formidable adversary and forgiving teacher, which significantly eases model training and enhances distribution matching. In (ganin2015unsupervised), adversarial training is used to ensure that learned features in the shared space are indistinguishable to the discriminator and invariant to the shift between domains. (pei2018multi) extends existing domain adversarial domain adaptation methods to multi-domain learning scenarios and proposed a multi-adversarial domain adaptation (MADA) approach to capture multi-mode structures to enable fine-grained alignment of different data distributions based on multiple domain discriminators.

3. Preliminaries

In this section, we first present definitions relevant to our work then rigorously formulate the transferable traffic forecasting problem.

Definition 1 (Road Network ).

A road network is represented as an undirected graph to describe its topological structure. is a set of nodes with , is a set of edges, and is the corresponding adjacency matrix of the road network. Particularly, we consider multiple road networks consisting of source networks and one target network. denotes the th source road network (), denotes the target road network, and we have , .

Definition 2 (Graph Signals ).

Let denote the traffic state observed on as a graph signal with node signals for , where represents the number of features of each node (e.g., flow, occupancy, speed). Specifically, we use to denote the observation on road network at time , and denotes the observation of node at time , , where is the study time period and .

We now define the transferable traffic forecasting problem.

Definition 3 (Transferable traffic forecasting).

Given historical graph signals observed on both source and target domains as input, we can divide the transferable traffic forecasting problem into the pre-training and fine-tuning stages.

In the pre-training stage, the forecasting task maps historical node (graph) signals to future node (graph) signals on a source road network , for :


where denotes the learned function parameters.

In the fine-tuning stage, to solve the forecasting task , the same function initialized with parameters shared from the pre-trained function is fine-tuned to predict graph signals on the target road network, for :


where denotes the function parameters adjusted from to fit the target domain. In the following sections, we consider the study time period: .

4. Proposed Methodology

In this section, we propose the domain adversarial spatial-temporal network (DastNet) to solve the transferable traffic forecasting problem. As shown in Figure 3, DastNet is trained with two stages, and we use two source domains in the figure for illustration. We first perform pre-training through all the source domains in turn without revealing labels from the target domain. Then, the model is fine-tuned on the target domain. We will explain the pre-training stage and fine-tuning stage in detail, respectively.

Figure 3. The proposed DastNet architecture.

4.1. Stage 1: Pre-training on Source Domains

In the pre-training stage, DastNet aims to learn domain-invariant knowledge that is helpful for forecasting tasks from multiple source domains. The learned knowledge can be transferred to improve the traffic forecasting tasks on the target domain. To this end, we design three major modules for DastNet: spatial encoder, temporal forecaster, and domain classifier.

The spatial encoder aims to consistently embed the spatial information of each node in different road networks. Mathematically, given a node ’s raw feature , in which is the dimension of raw features for each node, the spatial encoder maps it to a -dimensional node embedding , i.e., , where the parameters in this mapping are denoted as .

Given a learned node embedding for network , the temporal forecaster fulfils the forecasting task presented in Equation 1 by mapping historical node (graph) signals to the future node (graph) signals, which can be summarized by a mapping , and we denote the parameters of this mapping as .

Domain classifier takes node embedding

as input and maps it to the probability vector

for domain labels, and we use notation

to denote the one-hot encoding of the actual domain label of

. Note that the domain labels include all the source domains and the target domain. This mapping is represented as . We also want to make the node embedding domain-invariant. That means, under the guidance of the domain classifier, we expect the learned node embedding is independent of the domain label .

At the pre-training stage, we seek the parameters of mappings that minimize the loss of the temporal forecaster, while simultaneously seeking the parameters of mapping

that maximize the loss of the domain classifier so that it cannot identify original domains of node embeddings learned from spatial encoders. Note that the target domain’s node embedding is involved in the pre-training process to guide the target spatial encoder to learn domain-invariant node embeddings. Then we can define the loss function of the pre-training process as:


where trades off the two losses. represents the prediction error on source domains and is the adversarial loss for domain classification. Based on our objectives, we are seeking the parameters that reach a saddle point of the loss function:


We will introduce the details of loss functions in following sections.

4.1.1. Spatial Encoder

In traffic forecasting tasks, a successful transfer of trained GNN models requires the adaptability of graph topology change between different road networks. To solve this issue, it is important to devise a graph embedding mechanism that can capture generalizable spatial information regardless of domains. To this end, we generate the raw feature for a node by node2vec (grover2016node2vec) as the input of the spatial encoder. Raw node features learned from the node2vec can reconstruct the “similarity” extracted from random walks since nodes are considered similar to each other if they tend to co-occur in these random walks. In addition to modeling the similarity between nodes, we also want to learn localized node features to identify the uniqueness of the local topology around nodes. (xu2018powerful) proves that graph isomorphic network (Gin) layer is as powerful as the Weisfeiler-Lehman (WL) test (weisfeiler1968reduction) for distinguishing different graph structures. Thus, we adopt Gin layers with mean aggregators proposed in (xu2018powerful) as our spatial encoders. Mapping can be specified by a -layer Gin as follows:


where , denotes the neighborhoods of node and is a trainable parameter, , and is the total number of layers in Gin. The node ’s embedding can be obtained by .

4.1.2. Temporal Forecaster

The learned node embedding will be involved in the mapping

to predict future node signals. Now we will introduce our temporal forecaster, which aims to model the temporal dependencies of traffic data. Thus, we adapted the Gated Recurrent Units (

Gru), which is a powerful RNN variant (DBLP:journals/corr/ChungGCB14; fu2016using). In particular, we extend Gru by incorporating the learned node embedding into its updating process. To realize this, the learned node embedding is concatenated with the hidden state of Gru (we denote the hidden state for node at time as ). Details of the mapping is shown below:


where , , are update gate, reset gate and current memory content respectively. , , and are parameter matrices, and , , and are bias terms.

The pre-training stage aims to minimize the error between the actual value and the predicted value. A single-layer perceptrons is designed as the output layer to map the temporal forecaster’s output

to the final prediction . The source loss is represented by:


4.1.3. Domain Classifier

The difference between domains is the main obstacle in transfer learning. In the traffic forecasting problem, the primary domain difference that leads to the model’s inability to conduct transfer learning between different domains is the spatial discrepancy. Thus, spatial encoders are involved in learning domain-invariant node embeddings for both source networks and the target network in the pre-training process.

To achieve this goal, we involve a Gradient Reversal Layer (GRL) (ganin2015unsupervised)

and a domain classifier trained to distinguish the original domain of node embedding. The GRL has no parameters and acts as an identity transform during forward propagation. During the backpropagation, GRL takes the subsequent level’s gradient, and passes its negative value to the preceding layer. In the domain classifier, given an input node embedding

, is optimized to predict the correct domain label, and is trained to maximize the domain classification loss. Based on the mapping , is defined as:


where , and the ourput of is fed into the softmax, which computes the possibility vector of node belonging to each domain.

4.2. Stage 2: Fine-tuning on the Target Domain

The objective of the fine-tuning stage is to utilize the knowledge gained from the pre-training stage to further improve forecasting performance on the target domain. Specifically, we adopt the parameter sharing mechanism in (pan2009survey): the parameters of the target spatial encoder and the temporal forecaster in the fine-tuning stage are initialized with the parameters trained in the pre-training stage.

Moreover, we involve a private spatial encoder combined with the pre-trained target spatial encoder to explore both domain-invariant and domain-specific node embeddings. Mathematically, given a raw node feature , the private spatial encoder maps it to a domain-specific node embedding , this process is represented as , where has the same structure as and the parameter is randomly initialized. The pre-trained target spatial encoder maps the raw node feature to a domain-invariant node embedding , i.e., , where means that is initialized with the trained parameter in the pre-training stage. Note that and are of the same structure, and the process to generate and is the same as in Equation 5.

Before being incorporated into the pre-trained temporal forecaster, and are combined by layers to learn the combined node embedding of the target domain:


then given node signal at time and as input, is computed based on Equation 6, 7, 8, and 9. We denote the target loss at the fine-tuning stage as:


5. Experiments

We first validate the performance of DastNet using benchmark datasets, and then DastNet is experimentally deployed with the newly collected data in Hong Kong.

5.1. Offline Validation with Benchmark Datasets

Figure 4. Within-day traffic flow distributions.
PEMS04 15min 30min 60min
Ha 28.36±0.00 40.55±0.00 20.14±0.00 31.75±0.00 45.14±0.00 22.84±0.00 38.52±0.00 54.45±0.00 28.48±0.00
Svr 21.21±0.05 29.68±0.07 16.05±0.14 23.90±0.04 33.51±0.02 18.74±0.40 29.24±0.14 41.14±0.10 23.46±0.73
Gru 20.96±0.29 31.08±0.20 14.78±1.86 22.71±0.21 33.77±0.19 16.54±1.73 26.25±0.28 38.87±0.32 18.66±1.95
Gcn 48.65±0.04 68.89±0.06 40.53±0.88 49.49±0.05 69.97±0.06 41.42±0.78 51.63±0.06 72.65±0.07 44.03±0.49
Tgcn 24.09±1.35 34.31±1.59 18.26±1.38 25.22±0.96 36.09±1.22 19.34±1.07 27.16±0.65 38.76±0.94 20.84±0.37
Stgcn 27.03±1.30 38.26±1.35 25.16±4.33 27.91±0.88 39.65±0.78 25.33±5.06 35.55±2.43 49.12±4.01 37.74±5.15
Dcrnn 23.73±0.62 34.27±0.71 18.84±0.75 26.68±0.94 37.63±1.00 21.39±1.90 33.79±1.77 46.70±1.91 29.68±1.76
Trans Gru 20.70±0.60 30.80±0.46 14.72±1.91 22.22±0.15 33.19±0.13 15.53±0.76 25.88±0.09 38.33±0.12 17.84±1.09
Target Only 19.81±0.06 29.77±0.03 13.95±0.47 21.55±0.09 32.26±0.13 14.83±0.21 24.59±0.13 36.31±0.15 17.45±0.39
DastNet w/o Da 19.65±0.11 29.52±0.14 13.53±0.35 21.57±0.41 32.26±0.76 15.09±0.54 23.84±0.10 35.21±0.14 17.03±0.44
DastNet w/o Pri 19.35± 0.09 29.05±0.15 13.54±0.24 21.00±0.54 31.40±0.87 14.61±0.31 22.96±0.38 34.02±0.54 16.51±0.58
DastNet 19.25±0.03 28.91±0.05 13.30±0.22 20.67±0.07 30.78±0.04 14.56±0.31 22.82±0.08 33.77±0.13 16.10±0.18
PEMS07 15min 30min 60min
Ha 32.85±0.00 46.56±0.00 15.10±0.00 37.09±0.00 52.38±0.00 17.26±0.00 45.43±0.00 63.93±0.00 21.66±0.00
Svr 23.36±0.38 32.30±0.28 14.97±1.41 27.33±0.30 37.60±0.22 19.23±0.89 36.90±0.98 49.13±0.77 33.50±2.83
Gru 23.77±0.49 34.49±0.52 11.21±0.66 25.31±0.37 37.85±0.38 12.87±2.08 29.39±0.25 43.89±0.35 13.26±0.37
Gcn 50.81±0.56 71.67±0.50 36.47±1.57 51.94±0.24 73.18±0.30 39.10±1.26 55.09±0.07 77.15±0.10 41.46±0.42
Tgcn 30.18±0.41 42.11±0.56 15.74±0.99 30.84±2.77 43.58±3.37 15.19±1.59 33.25±1.45 47.24±1.82 16.58±1.04
Stgcn 34.14±6.13 48.58±7.32 19.67±6.38 39.50±2.76 43.58±3.37 15.09±1.59 43.45±2.50 60.67±3.23 27.57±1.36
Dcrnn 26.66±1.23 37.66±1.39 16.68±1.31 31.06±1.39 43.38±1.75 19.94±2.48 51.09±6.82 66.26±7.42 48.29±17.74
Trans Gru 23.11±0.54 34.07±0.38 10.97±1.25 24.70±0.20 37.13±0.22 10.98±0.58 28.55±0.18 42.72±0.22 12.67±0.17
Target Only 21.71±0.13 32.93±0.22 9.41±0.11 24.61±1.00 37.15±1.46 10.80±0.69 28.88±0.65 43.13±0.98 13.18±0.77
DastNet w/o Da 21.80±0.26 33.09±0.44 9.45±0.18 24.52±0.55 37.05±0.94 10.77±0.42 28.61±0.56 42.88±0.91 12.74±0.42
DastNet w/o Pri 21.23±0.14 32.28±0.24 9.20± 0.15 23.85±0.47 36.10±0.71 10.51±0.22 28.37±1.06 42.51±1.64 12.74±0.50
DastNet 20.91±0.03 31.85±0.05 8.95±0.13 22.96±0.10 34.80±0.11 9.87±0.19 26.88±0.28 40.12±0.29 11.75±0.33
PEMS08 15min 30min 60min
Ha 23.12±0.00 33.03±0.00 14.61±0.00 26.12±0.00 37.16±0.00 16.55±0.00 32.15±0.00 45.41±0.00 20.60±0.00
Svr 37.63±2.42 46.59±2.56 20.79±1.47 45.79±2.59 56.16±2.70 24.29±1.02 66.91±3.82 79.72±4.07 33.20±1.86
Gru 16.69±0.40 24.72±0.41 11.05±0.93 18.89±0.67 28.14±0.65 13.45±3.18 20.94±0.24 31.32±0.19 15.20±0.94
Gcn 64.63±0.08 87.30±0.10 90.32±1.83 65.09±0.06 87.87±0.08 91.64±1.12 66.24±0.11 89.21±0.10 94.01±1.93
Tgcn 20.65±0.96 28.77±1.13 15.06±1.20 21.60±1.44 30.40±1.78 15.97±2.42 24.33±2.51 34.20±3.14 17.91±4.77
Stgcn 25.90±1.60 35.58±1.98 18.91±2.35 26.20±1.75 36.52±2.34 17.73±0.74 31.89±4.23 43.94±5.56 20.99±2.41
Dcrnn 20.61±0.97 29.03±1.08 20.36±1.62 23.23±1.24 32.76±1.44 24.53±2.77 39.14±7.12 51.97±8.41 47.62±19.08
Trans Gru 15.99±0.10 23.95±0.11 9.93±0.45 17.77±0.40 26.56±0.31 12.08±1.75 20.03±0.33 29.86±0.21 14.80±2.10
Target Only 16.50±0.12 24.58±0.12 11.07±0.16 17.95±1.04 26.63±1.24 11.90±2.09 19.69±0.33 29.37±0.40 12.48±0.37
DastNet w/o Da 16.51±0.30 24.50±0.38 10.55±0.99 17.58±0.81 26.31±1.21 11.22±0.76 19.37±0.46 28.87±0.54 11.95±0.36
DastNet w/o Pri 15.75±0.25 23.60±0.41 10.00±0.22 16.87±0.38 25.38±0.68 10.55±0.14 18.90±0.20 28.28±0.20 12.52±0.64
DastNet 15.26±0.18 22.70±0.17 9.64±0.37 16.41±0.34 24.57±0.39 10.46±0.31 18.84±0.12 28.06±0.17 11.72±0.29
Table 1. Performance comparison of different methods with 10 days’ training data. (mean std)

We evaluate the performance of DastNet on three real-world datasets, PEMS04, PEMS07, PEMS08, which are collected from the Caltrans Performance Measurement System (PEMS) (caltrans) every 30 seconds. There are three kinds of traffic measurements in the raw data: speed, flow, and occupancy. In this study, we forecast the traffic flow for evaluation purposes and it is aggregated to 5-minute intervals, which means there are 12 time intervals for each hour and 288 time intervals for each day. The unit of traffic flow is veh/hour (vph). The within-day traffic flow distributions are shown in Figure 4. One can see that flow distributions vary significantly over the day for different datasets, and hence domain adaption is necessary.

The road network for each dataset are constructed according to actual road networks, and we defined the adjacency matrix based on connectivity. Mathematically, , where denotes node in the road network. Moreover, we normalize the graph signals by the following formula: , where function and function

calculate the mean value and the standard deviation of historical traffic data respectively.

5.1.1. Baseline Methods

  • [leftmargin=*]

  • Ha (liu2004summary): Historical Average method uses average value of historical traffic flow data as the prediction of future traffic flow.

  • Svr (smola2004tutorial)

    : Support Vector Regression adopts support vector machines to solve regression tasks.

  • Gru (cho2014properties): Gated Recurrent Unit (Gru) is a well-known variant of RNN which is powerful at capturing temporal dependencies.

  • Gcn (kipf2016semi): Graph Convolutional Network can handle arbitrary graph-structured data and has been proved to be powerful at capturing spatial dependencies.

  • Tgcn (zhao2019t): Temporal Graph Convolutional Network performs stably well for short-term traffic forecasting tasks.

  • Stgcn (yu2017spatio): Spatial-Temporal Graph Convolutional Network uses ChebNet and 2D convolutions for traffic prediction.

  • Dcrnn (li2017diffusion): Diffusion Convolutional Recurrent Neural Network combines GNN and RNN with dual directional diffusion convolutions.

  • Trans Gru: This is a simple transfer learning method that applies the same Gru to each node separately.

To demonstrate the effectiveness of each key module in DastNet, we compare with some variants of DastNet as follows:

  • [leftmargin=*]

  • Target Only: DastNet without training at the pre-training stage. The comparison with this baseline method demonstrate the merits of training on other data sources.

  • DastNet w/o Da: DastNet without the adversarial domain adaptation (domain classifier). The comparison with this baseline method demonstrate the merits of domain-invariant features.

  • DastNet w/o Pri: DastNet without the private encoder at the fine-tuning stage.

The rest of the setting of the above variants are the same as that of the DastNet.

5.1.2. Experimental Settings

To simulate the lack of data, for each dataset, we randomly select ten consecutive days’ traffic flow data from the original training set as our training set, and the validation/testing sets are the same as (li2021dynamic). We use one-hour historical traffic flow data for training and forecasting traffic flow in the next 15, 30, and 60 minutes (horizon=3, 6, and 12, respectively). For one dataset , DastNet-related methods and Trans Gru are pre-trained on the other two datasets, and fine-tuned on . Other methods are trained on . All experiments are repeated 5 times. Other hyper-parameters are determined based on the validation set. For the details of hyper-parameters and reproducibility, please refer to Appendix D.

5.1.3. Experiment Results

Table 1 shows the performance comparison of different methods for traffic flow forecasting. These methods are evaluated based on Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and Mean Absolute Percentage Error (MAPE). Details of these metrics are provided in Appendix B. It can be seen that DastNet

achieves the state-of-the-art forecasting performance on the three datasets for all evaluation metrics and all prediction horizons. Traditional statistical methods like

Ha and Svr are less powerful compared to deep learning methods such as Gru. The performance of Gcn is low, as it overlooks the temporal patterns of the data. DastNet outperforms existing spatial-temporal models like Tgcn, Stgcn, Dcrnn, and it achieves approximately 7.4%, 7.0% and 10.9% improvements compared to the best baseline method in MAE, RMSE, MAPE, respectively. Table 4 and Figure 8 in Appendix A summarize the improvements of our methods.

Ablation Study. From Table 1, MAE, RMSE and MAPE of the Target Only are reduced by approximately 4.7%, 7% and 10.6% compared to Gru (see Table 4), which demonstrates that the temporal forecaster outperforms Gru due to the incorporation of the learned node embedding. The accuracy of DastNet is superior to Target Only, DastNet w/o Da and DastNet w/o Pri, which shows the effectiveness of pre-training, adversarial domain adaptation and the private encoder. Interestingly, the difference between the results of the DastNet and DastNet w/o Pri on PEMS07 is generally larger than that on dataset PEMS04 and PEMS08. According to Figure 4, we know that the data distribution of PEMS04 and PEMS08 datasets are similar, while the data distribution of PEMS07 is more different from that of PEMS04 and PEMS08. This further implies that our private encoder can capture the specific domain information and supplement the information learned from the domain adaptation.

Effects of Domain Adaptation. To demonstrate the effectiveness of the proposed adversarial domain adaptation module, we visualize the raw feature of the node (generated from node2vec) and the learned node embedding (generated from spatial encoders) in Figure 5 using t-SNE (van2013barnes). As illustrated, node2vec learns graph connectivity for each specific graph, and hence the raw features are separate in Figure 5

. In contrast, the adversarial training process successfully guides the spatial encoder to learn more uniformly distributed node embeddings on different graphs.

Figure 5. Visualization of and by t-SNE.

Sensitivity Analysis. To further demonstrate the robustness of DastNet, we conduct additional experiments with different sizes of training sets. We change the number of days of traffic flow data in the training set. To be more specific, we used four training sets with 1 day, 10 days, 30 days and all data, respectively. Then we compare DastNet with Stgcn and Tgcn, which achieve stable and optimal performance among baseline methods. The performance of Dcrnn degrades drastically when the training set is small, so we do not include it in the comparison. We measure the performance of DastNet and the other two models on PEMS04, PEMS07, and PEMS08, by changing the ratio (measured in days) of the traffic flow data contained in the training set.

Figure 6. MAE under different sizes of training sets.
Figure 7. Traffic data and system workflow for the experimental deployment of DastNet in Hong Kong.

Experimental results of the sensitivity analysis are provided in Figure 6. In most cases, we can see that Stgcn and Tgcn underperform Ha when the training set is small. On the contrary, DastNet consistently outperforms other models in predicting different future time intervals of all datasets. Another observation is that the improvements over baseline methods are more significant for few-shot settings (small training sets). Specifically, the approximate gains on MAE decrease are 42.1%/ 23.3% /14.7% /14.9% on average for 1/10/30/all days for training compared with Tgcn and 46.7%/35.7% /30.7% /34% compared with Stgcn.

Case Study. We randomly select six detectors and visualize the predicted traffic flow sequences of DastNet and Stgcn, and the visualizations are shown in Figure 11 in Appendix A. Ground true traffic flow sequence is also plotted for comparison. One can see that the prediction generated by DastNet are much closer to the ground truth than that by Stgcn. Stgcn could accurately predict the peak traffic , which might be because DastNet learns the traffic trends from multiple datasets and ignores the small oscillations that only exist in a specific dataset.

5.2. Experimental Deployment in Hong Kong

By the end of 2022, we aim to deploy a reliable traffic information provision system in Hong Kong using traffic detector data on strategic routes from the Transport Department (transportdepartment). The new system could supplement the existing Speed Map Panels (SMP) and Journey Time Indication System (JTIS) by employing more reliable models and real-time traffic data. For both systems, flow data is essential and collected from traffic detectors at selected locations for the automatic incident detection purpose, as the JTIS and SMP make use of the flow data to simulate the traffic propagation, especially after car crashes (tam2011application). Additionally, DastNet could be further extended for speed forecasting. As we discussed in Section 1, the historical traffic data for the new detectors in Hong Kong are very limited. Figure 7 demonstrates: a) the spatial distribution of the newly deployed detectors in January 2022 and b) the corresponding traffic flow in Hong Kong. After the systematic process of the raw data as presented in c), traffic flow on the new detectors can be predicted and fed into the downstream applications once the detector is available.

We use the traffic flow data from three PEMS datasets for pre-training, and use Hong Kong’s traffic flow data on January 10, 2022 to fine-tune our model. All Hong Kong’s traffic flow data on January 11, 2022 are used as the testing set. Details of traffic flow data in Hong Kong are in the Appendix C. Meanwhile, Ha and the best spatial-temporal baseline methods Tgcn and Stgcn are adopted for comparisons. All experiments are repeated for 5 times, and the average results are shown in Table 2. One can read from the table that, with the trained DastNet from other datasets, accurate traffic predictions can be delivered to the travelers immediately (after one day) when the detector data is available.

HK 15min 30min 60min
Ha 21.02 28.66 23.26% 23.43 31.71 25.17% 28.13 37.76 29.03%
Tgcn 22.39 30.50 27.54% 22.39 30.48 26.76% 25.95 35.61 27.98%
Stgcn 39.86 55.79 46.80% 39.34 55.34 45.62% 42.52 58.95 52.94%
DastNet 14.91 21.10 18.18% 16.74 23.74 19.84% 20.14 28.58 22.76%
Table 2. Performance comparison on the newly collected data in Hong Kong.

6. Conclusion

In this study, we formulated the transferable traffic forecasting problem and proposed an adversarial multi-domain adaptation framework named Domain Adversarial Spatial-Temporal Network (DastNet). This is the first attempt to apply adversarial domain adaptation to network-wide traffic forecasting tasks on the general graph-based networks to the best of our knowledge. Specifically, DastNet is pre-trained on multiple source datasets and then fine-tuned on the target dataset to improve the forecasting performance. The spatial encoder learns the uniform node embedding for all graphs, the domain classifier forces the node embedding domain-invariant, and the temporal forecaster generates the prediction. DastNet obtained significant and consistent improvements over baseline methods on benchmark datasets and will be deployed in Hong Kong to enable the smooth transition and early deployment of smart mobility applications.

We will further explore the following aspects for future work: (1) Possible ways to evaluate, reduce and eliminate discrepancies of time-series-based graph signal sequences across different domains. (2) The effectiveness of the private encoder does not conform to domain adaptation theory (ben2010theory), and it is interesting to derive theoretical guarantees for the necessity of the private encoder on target domains. In the experimental deployment, we observe that the performance of existing traffic forecasting methods degrades drastically when the traffic flow rate is low. However, this situation is barely covered in the PEMS datasets, which could potentially make the current evaluation of traffic forecasting methods biased.


This study was supported by the Research Impact Fund for “Reliability-based Intelligent Transportation Systems in Urban Road Network with Uncertainty” from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. PolyU R5029-18). The authors thank the Transport Department of the Government of the Hong Kong Special Administrative Region for providing the relevant traffic data and suggestions for the experimental deployment in Hong Kong.


Appendix A Full Results

Table 3 demonstrates the full result of the sensitivity analysis, and 4 summarizes the comparison between 1) the best baseline method and DastNet, 2) Gru and Target Only. Figure 9 and 10 visualize the RMSE and MAPE results to supplement Figure 6.

PEMS04 15min 30min 60min
1 day Ha 28.36 40.55 20.14 31.75 45.14 22.84 38.52 54.45 28.48
Tgcn 35.00 48.43 29.12 36.64 50.22 33.04 40.50 55.83 29.54
Stgcn 34.51 47.86 34.48 37.26 53.60 30.05 42.11 58.46 37.31
DastNet 20.71 31.19 14.22 23.12 34.65 16.28 26.73 39.87 18.34
10 days Ha 28.36 40.55 20.14 31.75 45.14 22.84 38.52 54.45 28.48
Tgcn 24.09 34.31 18.26 25.22 36.09 19.34 27.16 38.76 20.84
Stgcn 27.03 38.26 25.16 27.91 29.65 25.33 35.55 49.12 37.74
DASTNet 19.28 28.96 13.52 20.67 30.78 14.56 22.82 33.77 16.10
30 days Ha 28.36 40.55 20.14 31.75 45.14 22.84 38.52 54.45 28.48
Tgcn 25.14 35.67 20.39 23.01 33.34 17.27 24.85 35.74 17.55
Stgcn 22.98 33.59 19.90 25.40 37.01 19.77 34.89 46.38 48.15
DastNet 19.47 29.29 13.91 21.01 31.41 14.55 21.42 31.95 14.92
all Ha 28.36 40.55 20.14 31.75 45.14 22.84 38.52 54.45 28.48
Tgcn 21.19 30.84 15.89 22.31 32.47 17.04 24.67 35.69 17.29
Stgcn 22.89 33.37 20.87 26.77 38.19 24.57 31.86 43.74 37.63
DastNet 19.48 29.30 13.37 20.25 30.26 14.35 21.4 32.09 15.19
PEMS07 15min 30min 60min
1 day Ha 32.85 46.52 15.10 37.09 52.38 17.26 45.43 63.93 21.66
Tgcn 39.73 54.00 20.44 40.35 55.59 20.75 43.89 60.45 25.31
Stgcn 41.73 58.78 27.23 43.71 61.13 26.73 53.82 73.24 31.07
DASTNet 22.40 33.56 11.47 25.21 37.33 13.38 29.92 44.23 16.73
10 days Ha 32.85 46.52 15.10 37.09 52.38 17.26 45.43 63.93 21.66
Tgcn 30.18 42.11 15.74 30.84 43.58 15.19 33.25 47.24 16.58
Stgcn 34.14 48.58 19.67 39.50 43.58 15.09 43.45 60.67 27.57
DASTNet 20.91 31.85 8.95 22.96 34.80 9.87 27.16 40.41 12.08
30 days Ha 32.85 46.52 15.10 37.09 52.38 17.26 45.43 63.93 21.66
Tgcn 25.4 36.51 12.53 28.03 40.13 14.14 31.32 44.95 15.67
Stgcn 28.70 40.93 15.05 42.68 54.61 32.99 40.50 55.30 26.18
DASTNet 21.60 32.85 9.22 23.92 35.88 10.41 27.65 41.2 12.13
all Ha 32.85 46.52 15.10 37.09 52.38 17.26 45.43 63.93 21.66
Tgcn 24.86 35.93 11.76 25.73 37.62 11.5 28.55 41.60 13.11
Stgcn 34.01 45.15 24.38 44.13 55.73 36.44 39.12 52.87 30.68
DASTNet 19.80 30.52 8.63 21.24 32.58 9.21 23.92 36.18 10.79
PEMS08 15min 30min 60min
1 day Ha 23.12 33.03 14.61 26.12 37.16 16.55 32.15 45.41 20.60
Tgcn 36.91 48.18 41.74 38.69 51.09 38.98 40.51 53.68 40.87
Stgcn 42.88 59.27 41.71 42.72 58.55 43.63 43.75 58.83 37.05
DastNet 16.29 24.35 10.56 18.1 27.60 11.50 21.47 31.99 15.51
10 days Ha 23.12 33.03 14.61 26.12 37.16 16.55 32.15 45.41 20.60
Tgcn 20.65 28.77 15.06 26.20 36.52 17.73 24.33 34.20 17.91
Stgcn 25.90 35.58 18.91 23.23 32.76 24.53 31.89 43.94 20.99
DASTNet 15.44 22.87 9.64 16.41 24.57 10.46 20.03 29.86 14.80
30 days Ha 23.12 33.03 14.61 26.12 37.16 16.55 32.15 45.41 20.60
Tgcn 18.29 25.99 12.33 19.77 28.11 13.60 21.30 30.43 13.81
Stgcn 21.93 30.83 15.11 23.62 33.27 14.72 26.38 36.67 19.84
DastNet 15.32 22.92 9.55 16.34 24.68 10.34 18.37 27.57 11.31
all Ha 23.12 33.03 14.61 26.12 37.16 16.55 32.15 45.41 20.60
Tgcn 17.81 25.37 11.46 20.40 28.82 12.72 20.22 29.28 13.11
Stgcn 21.82 30.6 15.02 21.15 30.34 4.65 23.48 33.60 15.68
DastNet 15.42 23.10 10.03 15.86 23.72 10.68 17.69 26.49 12.11
Table 3. Forecasting accuracy metrics on sensitivity analysis.
Impv. 15min 30min 60min
04 5.4% 4.2% 6.2% 5.1% 4.4% 10% 6.3% 6.5% 6.5%
07 8.6% 4.5% 16% 2.8% 1.8% 16% 1.7% 1.7% 0.6%
08 1.1% 0.6% - 5.0% 5.4% 11.5% 6.0% 6.2% 18.0%
Impv. 15min 30min 60min
04 7.0% 6.1% 9.6% 7.0% 7.2% 6.2% 11.8% 11.9% 9.8%
07 9.5% 6.5% 18.4% 7.0% 6.3% 10.1% 5.9% 6.1% 7.3%
08 4.6% 5.2% 2.9% 7.7% 7.5% 13.4% 5.9% 6.0% 20.8%
Table 4. Comparison between 1) Gru and the Target Only method (Upper); 2) DastNet and the best baseline method (Lower).

where ”-” denotes no improvements.

Figure 8. Performance comparisons of different methods.
Figure 9. RMSE under different sizes of training sets.
Figure 10. MAPE under different sizes of training sets.
Figure 11. Visualization of the predicted flow by DastNet and Stgcn.

Appendix B Details for Evaluation Metrics

Let denote the ground truth and represent the predicted values, and denotes the set of training samples’ indices. The performance of all methods are evaluated by three commonly used metrics in traffic forecasting, including (1) Mean Absolute Error (), which is a fundamental metric to reflect the actual situation of the prediction accuracy. (2) Root Mean Squared Error (), which is more sensitive to abnormal values. (3) Mean Absolute Percentage Error ().

Appendix C Deployment Details in Hong Kong

Figure 12. Traffic flows’ spatial distributions in Hong Kong, January 2022.

We used 614 traffic detectors (a combinations of video detectors and automatic licence plate recognition detectors) to collect Hong Kong’s traffic flow data (as shown in Figure 12) for the deployment of our system, and the raw traffic flow data is aggregated to 5-minute intervals. We construct Hong Kong’s road network based on distances between traffic detectors and define the adjacency matrix through connectivity.

Appendix D Details for Reproducibility

In this section, we provide more details of experimental settings for reproducibility. We implement our framework based on Pytorch

(paszke2019pytorch) on a virtual workstation with two 11G memory Nvidia GeForce RTX 2080Ti GPUs. To suppress noise from the domain classifier at the early stages of the pre-training procedure, instead of fixing the adversarial domain adaptation loss factor . We gradually change it from 0 to 1: , where ,

was set to 10 in all experiments. We select the SGD optimizer and set the maximum epochs for fine-tuning stage to 2000 and set K of

Gin encoders as 1 and 64 as the dimension of node embedding. For all model we set 64 as the batch size. For node2vec settings, we set , and the number of walks conducted by each source node is set to 200 with 8 as the walk length and 64 as the embedding dimension.

Model settings:

Methods Gru Gcn Tgcn Stgcn Dcrnn Trans Gru DASTNet
Learning rates 1e-3 1e-4 5e-4 1e-5 1e-5 1e-4 1e-4

MLP settings:

(inout features) (64,64) (64,64) (64,64) (64,64) (128,64)

Pre-training epochs & Training ratio settings:

Datasets PEMS04 PEMS07 PEMS08
Pre-train epochs 10 30 30
Training ratios (1/10/30/all) (2.4% /24% /72% /100%) (1.4% /14% /42% /100%) (2.3% /23% /69% /100%)

Source codes of DastNet are available at