MetaST
MetaST for WWW 2019
view repo
Spatial-temporal prediction is a fundamental problem for constructing smart city, which is useful for tasks such as traffic control, taxi dispatching, and environmental policy making. Due to data collection mechanism, it is common to see data collection with unbalanced spatial distributions. For example, some cities may release taxi data for multiple years while others only release a few days of data; some regions may have constant water quality data monitored by sensors whereas some regions only have a small collection of water samples. In this paper, we tackle the problem of spatial-temporal prediction for the cities with only a short period of data collection. We aim to utilize the long-period data from other cities via transfer learning. Different from previous studies that transfer knowledge from one single source city to a target city, we are the first to leverage information from multiple cities to increase the stability of transfer. Specifically, our proposed model is designed as a spatial-temporal network with a meta-learning paradigm. The meta-learning paradigm learns a well-generalized initialization of the spatial-temporal network, which can be effectively adapted to target cities. In addition, a pattern-based spatial-temporal memory is designed to distill long-term temporal information (i.e., periodicity). We conduct extensive experiments on two tasks: traffic (taxi and bike) prediction and water quality prediction. The experiments demonstrate the effectiveness of our proposed model over several competitive baseline models.
READ FULL TEXT VIEW PDFMetaST for WWW 2019
Recently, the construction of smart cities significantly changes urban management and services. One of the most fundamental techniques in building smart city is accurate spatial-temporal prediction. For example, a traffic prediction system can help the city pre-allocate transportation resources and control traffic signal intelligently. An accurate environment prediction system can help the government develop environment policy and then improve the public’s health.
Traditionally, basic time series models (e.g., ARIMA, Kalman Filtering and their variants)
(Shekhar and Williams, 2008; Moreira-Matias et al., 2013; Lippi et al., 2013), regression models with spatial-temporal regularizations (Idé and Sugiyama, 2011; Zheng and Ni, 2013) and external context features (Yi et al., 2018; Wu et al., 2016)are used for spatial-temporal prediction. Recently, advanced machine learning methods (e.g., deep neural network based models) significantly improve the performance of spatial-temporal prediction
(Yao et al., 2018; Yi et al., 2018; Zhang et al., 2017)by characterizing non-linear spatial-temporal correlations more accurately. Unfortunately, the superior performance of these models is conditioned on large-scale training data which are probably inaccessible in real-world applications. For example, there may exist only a few days of GPS traces for traffic prediction in some cities.
Transfer learning has been studied as an effective solution to address the data insufficiency problem, by leveraging knowledge from those cities with abundant data (e.g., GPS traces covering a few years). In (Wei et al., 2016), the authors proposed to transfer semantically-related dictionaries learned from a data-rich city, i.e., a source city, to predict the air quality category in a data-insufficient city, i.e., a target city. The method proposed in (Wang et al., 2018) aligns similar regions across source and target cities for finer-grained transfer. However, these methods, transferring the knowledge from only a single source city, would cause unstable results and the risk of negative transfer. If the underlying data distributions are significantly different between cities, the knowledge transfer will make no contribution or even hurt the performance.
To reduce the risk, in this paper, we focus on transferring knowledge from multiple source cities for the spatial-temporal prediction in a target city. Compared with single city, the knowledge extracted from multiple cities covers more comprehensive spatial-temporal correlations of a city, e.g., temporal dependency, spatial closeness, and region functionality, and thus increases the stability of transfer. However, this problem faces two key challenges.
[leftmargin=*]
How to adapt the knowledge to meet various scenarios of spatial-temporal correlations in a target city? The spatial-temporal correlations in the limited data of a target city may vary considerably from city to city and even from time to time. For example, the knowledge to be transferred to New York is expected to differ from that to Boston. In addition, the knowledge to be transferred within the same city also differs from weekdays to weekends. Thus a sufficiently flexible algorithm capable of adapting the knowledge learned from multiple source cities to various scenarios is required.
How to capture and borrow long-period spatial-temporal patterns from source cities? It is difficult to capture long-term spatial-temporal patterns (e.g., periodicity) accurately in a target city with limited data due to the effects of special events (e.g. holiday) or missing values. These patterns as indicators of region functionality, however, are crucial for spatial-temporal prediction (Yao et al., 2019; Zonoozi et al., 2018). Take the traffic demand prediction as an instance. The traffic demand in residential areas is periodically high in the morning when people transit to work. Thus, it is promising but challenging to transfer such long-term patterns from source cities to a target city.
To address the challenges, we propose a novel framework for spatial-temporal prediction, namely MetaST. It is the first to incorporate the meta-learning paradigm into spatial-temporal networks (ST-net). The ST-net consists of a local CNN and an LSTM which jointly capture spatial-temporal features and correlations. With the meta-learning paradigm, we solve the first challenge by learning a well-generalized initialization of the ST-net from a large number of prediction tasks sampled from multiple source cities, which covers comprehensive spatial-temporal scenarios. Subsequently, the initialization can easily be adapted to a target city via fine-tuning, even when only few training samples are accessible. Second, we learn a global pattern-based spatial-temporal memory from all source cities, and transfer it to a target city to support long-term patterns. The memory, describing and storing long-term spatial-temporal patterns, is jointly trained with the ST-net in an end-to-end manner.
We evaluate the proposed framework on several datasets including taxi volume, bike volume, and water quality. The results show that our proposed MetaST consistently outperforms several baseline models. We summarize our contributions as follows.
[leftmargin=*]
To the best of our knowledge, we are the first to study the problem of transferring knowledge from multiple cities for the spatial-temporal prediction in a target city.
We propose a novel MetaST framework to solve the problem by combining a spatial-temporal network with the meta-learning paradigm. Moreover, we learn from all source cities a global memory encrypting long-term spatial-temporal patterns, and transfer it to further improve the spatial-temporal prediction in a target city.
We empirically demonstrate the effectiveness of our proposed MetaST framework on three real-world spatial-temporal datasets.
The rest of this paper is organized as follows. We first review and discuss the related work in Section 2. Then, we define some notations and formulate the problem in Section 3. After that, we introduce details of the framework of MetaST in Section 4. We apply our model on three real-world datasets from two different domains and conduct extensive experiments in Section 5. Finally, we conclude our paper in Section 6.details of the framework of MetaST in Section 4. We apply our model on three real-world datasets from two different domains and conduct extensive experiment in Section 5. Finally, we conclude our paper in Section 6.
In this section, we briefly review the works in two categories: some representative works for spatial-temporal prediction and knowledge transferring.
The earliest spatial-temporal prediction models are based on basic time series models (e.g., ARIMA, Kalman Filtering) (Shekhar and Williams, 2008; Moreira-Matias et al., 2013; Lippi et al., 2013). Recent studies further utilize external context data (e.g., weather condition, venue types, holiday, event information) (Yi et al., 2018; Wu et al., 2016; Wang et al., 2017b) to enhance the prediction accuracy. Also, spatial correlation is taken into consideration by designing regularization of spatial smoothness (Tong et al., 2017; Xu et al., 2016; Zheng and Ni, 2013).
Recently, various deep learning methods have been used to capture complex non-linear spatial-temporal correlations and predict spatial-temporal series, such as stacked fully connected network
(Wang et al., 2017a; Yi et al., 2018), convolutional neural network (CNN)
(Zhang et al., 2017; Wang et al., 2017c)and recurrent neural network (RNN)
(Yu et al., 2017a). Several hybrid models have been proposed to model both spatial and temporal information (Yao et al., 2018; Ke et al., 2017; Yao et al., 2019). These methods combine CNN and RNN, and achieve the state-of-the-art performance on spatial-temporal prediction. In addition, based on the road network structure, another type of hybird models combines graph aggregation mechanism and RNN for spatial-temporal prediction (Zhang et al., 2018; Li et al., 2018; Yu et al., 2017b)Different from previous studies of deep spatial-temporal prediction which all rely on a large set of training samples, we aim to transfer learned knowledge from source cities to improve the performance of spatial-temporal prediction in a target city with limited data samples.
Transfer learning and its related fields utilize previously learned knowledge to facilitate learning in new tasks when labeled data is scarce (Naik and Mammone, 1992; Pan et al., 2010). Previous transfer learning methods transfer different information from a source domain to a target domain, such as parameters (Yang et al., 2007), instances (Dai et al., 2009), manifold structures (Gong et al., 2012; Gopalan et al., 2011), deep hidden feature representations (Tzeng et al., 2017; Yosinski et al., 2014). Recently, meta-learning (a.k.a., learning to learn) transfers shared knowledge from multiple training tasks to a new task for quick adaptation. These techniques include learning a widely generalizable initialization (Finn et al., 2017; Liu et al., 2018), optimization trace (Ravi and Larochelle, 2016), metric space (Snell et al., 2017).
However, only a few attempts have been made on transferring knowledge on the space. (Wei et al., 2016) proposes a multi-modal transfer learning framework for predicting air quality category, which combines multi-modal data by learning a semantically coupled dictionary for multiple modalities in a source city. However, this method works on multimodal features instead of spatial-temporal sequences we focus on. Therefore, it cannot be applied to solve the problem. For predicting traffic flow, (Wang et al., 2018) leverages the similarity of check-in records/spatial-temporal sequences between a source city and a target city to construct the similarity regularization for knowledge transfer. Different from this method that intelligently learns the correlation which could be linear or non-linear. Compared with both methods above, in addition, our model transfers the shared knowledge from multiple cities to a new city, which increase the stability of transfer and prediction.
In this section, we define some notations used in this paper and formally define the setup of our problem.
Definition 1 (Region) Following previous works (Zhang et al., 2017; Yao et al., 2018), we spatially divide a city into an grid map which contains rows and columns. We treat each grid as a region , and define the set of all regions as .
Definition 2 (Spatial-Temporal Series) In city , we denote the current/latest timestamp as , and the time range as a set consisting of evenly split non-overlapping time intervals. Then, the spatial-temporal series in city is represented as
(1) |
where is the spatial-temporal information to be predicted (e.g., traffic demand, air quality, climate value).
Problem Definition Suppose that we have a set of source cities and a target city with insufficient data (i.e., , ), our goal is to predict the spatial-temporal information of the next timestamp in the testing dataset of the target city , i.e.,
(2) |
where represents the ST-net which serves as the base model to predict the spatial-temporal series. Detailed discussion about ST-net is in given Section 4.1. More importantly, in the meta-learning paradigm, denotes the initialization for all parameters of the ST-net, which is adapted from . Important notations used in this paper are defined in Table 1.
Notation | Description | ||
set of source cities | |||
set of target cities | |||
#of non-overlapping time intervals in a target city | |||
|
|||
spatial-temporal memory | |||
base model: the spatial-temporal network | |||
the cluster index of each region | |||
|
|||
|
|||
pattern representation for region at time |
In this section, we elaborate our proposed method MetaST. In particular, we first introduce the ST-net as the base model for spatial-temporal prediction, and then present our proposed spatial-temporal meta-learning framework for knowledge transfer.
Recently, hybrid models combining convolution neural networks (CNN) (Krizhevsky et al., 2012) and LSTM (Hochreiter and Schmidhuber, 1997) have achieved the state-of-the-art performances on spatial-temporal prediction. Thus, following (Yao et al., 2018), we adopt a CNN to capture the spatial dependency between regions, and an LSTM to model the temporal evolvement of each region.
In city , for each region at time , we regard the spatial-temporal value of this region and its surrounding neighbors as an image with channels , where region is in the center of this image. For instance, when N=3, we are considering a center cell as well as 8 adjacent grid cells of the cell, which is a 3*3 size neighborhood. The number of channels depends on a specific task. For example, in taxi volume prediction, we jointly predict taxi pick-up volume and drop-off volume, so that the number of channels equals two, i.e., . Taking as input , a local CNN computes the -th layer progressively:
(3) |
where represents the convolution operation, and are learnable parameters. After convolutional layers, a fully connected layer following a flatten layer is used to infer the spatial representation of region as eventually.
Then, for predicting , we model the temporal unfolding of region by passing all the spatial representations along the time span through an LSTM, which can be formulated as
(4) |
where denotes Hadamard product, , , and () are learnable parameters, , , and
are forget gate vector, input gate vector, and output gate vector of the
-th context feature, respectively. denotes the spatial-temporal representation of region , and is the number of time steps we use to consider the temporal information. Note that represents other external features (e.g., weather, holiday) that can be incorporated, if applicable. By doing these, encodes both the spatial and temporal information of region . As a result, the value of the next time step of spatial-temporal series, i.e., , can be predicted by(5) |
where and are learnable parameters. The output is scaled to (-1,1) via a
function, to be consistent with the normalized spatial-temporal values. We later denormalize the prediction to get the actual demand values. We formulate the loss function of ST-net for each city
as:(6) |
so as to enforce the estimated spatial-temporal value to be as close to the ground-truth
as possible. As introduced previously, we denote all the parameters of the ST-net as , and the ST-net parameterized by as . For better illustration, we visualize the structure of the spatial-temporal network (ST-net) in Figure 1.Next, we propose a meta-learning framework that enables our ST-net model to borrow knowledge from multiple cities. The framework consists of two parts: adapting the initialization and learning the spatial-temporal memory. We present the whole framework in Figure 2.
The framework of proposed MetaST. ST-net and ST-mem mean spatial-temporal network and spatial-temporal memory. S-train and S-test, T-train and T-test represent the training and testing set of source tasks (i.e., source cities) and target tasks (i.e., source cities), respectively. In the process of knowledge transfer, the parameters of ST-net will be updated by the training set in target city (i.e., T-train), while the parameters of ST-mem are fixed.
As we described before, we are supposed to increase the stability of prediction by transferring knowledge from multiple source cities. Since the spatial-temporal correlation of limited data in a target city may vary considerably from city to city and even from time to time. For example, in traffic prediction, the knowledge to be transferred to Boston with limited weekend data is expected to differ from that to Chicago with limited weekday data. Thus, the extracted knowledge from multiple cities is expected to include comprehensive spatial-temporal correlations such as spatial closeness and temporal dependency, so that we can adapt the knowledge to a target city with limited data under different scenarios.
In ST-net, the parameters are exactly the knowledge which encrypts spatial-temporal correlations. To effectively adapt the parameters to a target city, as suggested in model-agnostic meta-learning (MAML) (Finn et al., 2017), initialization of from multiple source cities, i.e., , so that the ST-net initialized by achieves the minimum of the average of generalization losses over all source cities, i.e.,
(7) |
Here we would note that denotes the training loss on the training set of a city sampled from , i.e., (refer to S-train in Figure 2). We illustrate the iterative update of the parameters during the training process (shown as the yellow solid arrow in Figure 2), by showing one exemplar gradient descent, i.e.,
(8) |
In practice, we can use several steps of gradient descent to update from the initialization to . For each city , the training process is repeated on batches of series sampled from .
Since Eq. (7) minimizes the generalization loss, evaluates the loss on the test set of the city , i.e., (refer to S-test in Figure 2). By optimizing the problem in Eq. (7
) using stochastic gradient descent (shown as the purple solid arrow in Figure
2), we obtain an initialization which can generalize well on different source cities. Therefore, it is widely expected that transferring the initialization to a target city would also bring superior generalization performance, which we will detail in the end of this section.In spatial-temporal prediction, long-term spatial-temporal patterns (e.g., periodic patterns) play an important role (Zonoozi et al., 2018; Yao et al., 2019). These patterns reflect the spatial functionality of each region and can be regarded as the global property shared by different cities. An example of spatial-temporal patterns and their corresponding regions are shown in Figure 3.
In this figure, one region around NYU-Tardon in New York City and another one around Georgetown University in Washington DC are selected. The averaged 5-days’ taxi demand distributions of both regions are daily periodic and similar, whose value are higher in the afternoon when students and faculties go back to home. The similarity of distributions between different cities verifies our assumption, i.e., spatial functionality is globally shared. However, in a target city, these patterns are hard to be captured with limited data due to missing values or the effects of special events (e.g., holidays). Thus, we propose a global memory, named ST-mem, to store long-term spatial-temporal patterns and further transfer to improve prediction accuracy in target cities. The framework of ST-Mem is illustrated in Figure 4.
Based on spatial-temporal patterns, we first cluster all the regions in source cities to categories. The categories of regions represent different region functionalities. For region , the clustering results are denoted as . If region belongs to cluster , , otherwise . Then, we construct a parameterized spatial-temporal memory . Each row of the memory stores the pattern representation of a category, and the dimension of the pattern representation is .
Next, we utilize the knowledge of patterns stored in memory , and distill this knowledge for prediction via attention mechanism (Luong et al., 2015). Since we only have short-term data in a target city, we use the representation of short-term data to query ST-mem. In particular, we linearly project the spatial-temporal representation of ST-net to get the query vector for the attention mechanism, which is formulated as:
(9) |
Then, we use the query vector to calculate the similarity score between it and the pattern representation for each category . Formally, the similarity score is defined as
(10) |
where means the -th row of memory (i.e., for the -th pattern category). We calculate the representation of spatial-temporal pattern as:
(11) |
Then, we concatenate the representation of spatial-temporal patterns with the representation of ST-net. And the input of the final layer in Eq. (5) is replaced by the enhanced representation i.e.,
(12) |
where and are learnable parameters.
In order to learn the memory , we construct the clustering loss of city , which enforce the attention scores to be consistent with previous clustering results . The formulation of the clustering loss is as follows:
(13) |
Additionally, the memory is also updated when we train the MetaST framework, together with the initialization . Thus, we revise the loss function in Eq. (7) by adding the clustering loss. Then, our final objective function is:
(14) |
where is a trade-off hypeparameter and is used to balance the effect of each part. is defined in Eq. (8). By using testing set of each city , the memory and initialization are updated by gradient descent. Note that, as we discussed before, the spatial-temporal patterns are common property between cities. We do not update memory when training a specific task (i.e., Eq. (8)). In Figure 2, the purple solid arrow indicates the process of meta-update. In Eq. (14), the initialization and memory are mutually constrained. Since the memory provides region-level relationship via spatial-temporal patterns, it can also help learn the initialization of ST-net.
To improve the prediction in target cities, we transfer the ST-mem and initialization of ST-net. The black dotted line in Figure 2 shows the process of knowledge transfer. For each new city sampled from , the memory is fixed and the parameters are trained with initialization and training samples (i.e., T-train in Figure 2), which is defined as:
(15) |
where is the loss function of training set in target city and the formulation is:
(16) |
The predicted value is calculated via Eq. (12). Hence, we distill knowledge of spatial-temporal patterns from source cities to target city via ST-mem . Finally, we evaluate the model on testing set (i.e., T-test in Figure 2) of city and get the prediction value of this city. The whole framework of MetaST is shown in Algorithm 1.
In this section, we conduct extensive experiments for two domain applications to answer the following research questions:
[leftmargin=0.15in]
Whether MetaST can outperform baseline methods for inference tasks, i.e., traffic volume prediction in and water quality (pH value) prediction?
How do various hyper-parameters, e.g., the dimensions of each cluster in ST-mem or trade-off factor, impact the model’s performance?
Whether ST-mem can detect distinguished spatial-temporal patterns?
We first conduct experiments on two traffic prediction tasks, i.e., taxi volume prediction and bike volume prediction. Similar as the previous traffic prediction task in (Yao et al., 2019; Zhang et al., 2017), each individual trip departs from a region, and then arrives at the destination region. Thus, our task is to predict the pick-up (start) and drop-off (end) volume of taxi (and bike) at each time interval for each region. The time inveral of traffic prediction task is one hour. We use Root Mean Square Error (RMSE) to evaluate our model, which is defined as:
(17) |
where and represent predicted value and actual value, respectively. means the number of all samples in testing set.
For taxi volume prediction, we collect five representative mobility datasets from five different cities to evaluate the performance of our proposed model, i.e., New York City (NYC)111http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml, Washington DC (DC), Chicago (CHI), Porto222https://www.kaggle.com/c/pkdd-15-predict-taxi-service-trajectory-i/data, and Boston (BOS). We use NYC, DC, Porto as the source cities and CHI, BOS as the target cities. Note that the Boston data does not have drop-off records, and thus we only report the results on predicting pick-up volume.
For bike volume prediction, we collect three datasets from four cities, i.e., NYC333https://www.citibikenyc.com/system-data, DC444https://www.capitalbikeshare.com/system-data, and CHI555https://www.divvybikes.com/system-data. NYC and DC are used as source cities, and CHI are used as target city. All trips in the above datasets contain time and geographic coordinates of pick-up and drop-off. For each city above, as discussed before, we spatially divide them to a grid map. The grid map size of NYC, DC, CHI, BOS, Porto is , , , , , respectively. Detailed statistics of these datasets are listed in Table 2.
In addition, for each source city, we select 80% data for training and validation, and the rest for testing. For each target city, we select the 1-day, 3-day and 7-day data for training, and the rest for testing.
Data | City | Time Span (m/d/y) | Trips | Size |
Taxi | NYC | 1/1/15-7/1/15 | 6,748,857 | 1020 |
DC | 5/1/15-1/1/16 | 8,151,077 | 1616 | |
Porto | 7/1/13 - 6/30/14 | 1,710,671 | 2010 | |
CHI | 9/1/13-11/1/14 | 124,820 | 1518 | |
BOS | 10/1/12 - 10/31/12 | 839,897 | 1815 | |
Bike | NYC | 1/1/17-12/31/17 | 16,364,475 | 1020 |
DC | 1/1/17-12/31/17 | 3,732,263 | 1616 | |
CHI | 1/1/17-2/1/17 | 106,165 | 1518 |
We compare our model with the following two categories of methods: non-transfer methods and transfer methods. Note that, for non-transfer baselines, we only use the training data of target cities (limited training data) to train the model and use the testing data to evaluate it. For transfer baselines, we transfer the learned knowledge from source cities to improve the prediction accuracy.
Non-transfer baselines:
[leftmargin=*]
HA: For each region, historical average (HA) predicts spatial-temporal value based on the average value of the previous relative time. For example, if we want to predict the pick-up volume at 9:00am-10:00am, HA is the average value of all time intervals from 9:00am to 10:00am in training data.
ARIMA: Auto-Regressive Integrated Moving Average (ARIMA) is a traditional time-series prediction model, which considers the temporal relationship of data.
ST-net: This method simply use the spatial-temporal neural network defined in Section 4.1 to predict traffic volume. Both pick-up and drop-off volume are predicted together.
Transfer baselines:
[leftmargin=*]
Fine-tuning Methods: We include two types of fine-tuning methods: (1) Single-source fine-tuning (Single-FT): train ST-net in one source city (e.g., NYC, DC or Porto in taxi data) and fine-tune the model for target cities; and (2) multi-source fine-tune (Multi-FT): mix all samples from all source cities and fine-tune the model in target cities.
RegionTrans (Wang et al., 2018): RegionTrans transfers knowledge from one city to another city for traffic flow prediction. Since we do not have auxiliary data, we compare the S-Match of RegionTrans. For each region in target city, RegionTrans uses short period data to calculate the linear similarity value with each region in source city. Then, the similarity is used as regularization for fine-tuning. For fair comparison, the base model of RegionTrans (i.e., the ST-net) is same as MetaST.
MAML (Finn et al., 2017): an state-of-the-art meta-learning method, which learns a better initialization from multiple tasks to supervise the target task. MAML uses the same base model (i.e., the ST-net) as MetaST.
Hyperparameter Setting. All hyperparameters are set based on the performance on validation set. For ST-net, we set the number of filters in CNN as 64, the size of map in CNN as , the number of steps in LSTM as 8, and the dimension of hidden state of LSTM as 128. In the training process of taxi volume prediction, we set the learning rate of inner loop and outer loop as and respectively. For bike volume prediction, we set the learning rate of inner loop and outer loop as and respectively. The parameter is set as . The number of updates for each task is set as 5. All the models are trained by Adam. The training batch size for each meta-iteration is set as 128, and the maximum of iteration of meta-learning is set as 20000.
Spatial-temporal Clustering.
In addition, since the pattern of traffic volume usually repeats every day. Thus, in this work, we use averaged 24-hour patterns of each region to represent its spatial-temporal pattern. We use K-means to cluster these patterns to 4 groups. Furthermore, in this work, we do not use other external features, which means that
in Eq. (4) equals to . We set the size of pattern representation in memory as 8.Taxi Data | Chicago | Boston | ||||||||
Pick-up | Drop-off | Pick-up | ||||||||
1-day | 3-day | 7-day | 1-day | 3-day | 7-day | 1-day | 3-day | 7-day | ||
HA | 2.83 | 2.36 | 2.18 | 2.67 | 2.28 | 2.13 | 11.07 | 9.13 | 7.71 | |
ARIMA | 3.19 | 2.76 | 2.71 | 2.93 | 2.43 | 2.41 | 11.02 | 10.25 | 9.36 | |
ST-net | 10.51 | 6.04 | 3.89 | 11.22 | 6.42 | 3.99 | 30.01 | 17.02 | 13.28 | |
Single-FT | NYC | 2.72 | 2.06 | 1.76 | 2.84 | 2.75 | 1.89 | 12.86 | 9.50 | 8.11 |
DC | 3.90 | 2.62 | 2.05 | 4.17 | 2.19 | 2.15 | 15.88 | 10.70 | 10.16 | |
Porto | 2.57 | 1.87 | 1.60 | 2.87 | 2.03 | 1.74 | 12.91 | 8.54 | 8.08 | |
Multi-FT | 2.18 | 1.89 | 1.60 | 2.20 | 2.08 | 1.69 | 8.50 | 8.22 | 8.01 | |
RegionTrans | NYC | 2.53 | 2.01 | 1.69 | 2.83 | 2.56 | 1.72 | 11.98 | 9.46 | 7.95 |
DC | 3.87 | 2.51 | 2.04 | 3.95 | 2.16 | 2.03 | 14.76 | 9.23 | 10.12 | |
Porto | 2.45 | 1.83 | 1.60 | 2.85 | 1.98 | 1.73 | 8.43 | 8.09 | 8.07 | |
MAML | 2.01 | 1.78 | 1.52 | 2.10 | 1.92 | 1.66 | 8.18 | 7.60 | 7.25 | |
MetaST | 1.95** | 1.70** | 1.48** | 2.04** | 1.79** | 1.65* | 7.81** | 6.97** | 6.58** | |
Relative Improvement | 3.0% | 4.5% | 2.6% | 2.9% | 6.8% | 0.6% | 4.5% | 8.3% | 9.2% |
** (*) means the result is significant according to Student’s T-test at level 0.01 (0.05) compared to MAML
Bike Data | Chicago | ||||||
Pick-up | Drop-off | ||||||
1-day | 3-day | 7-day | 1-day | 3-day | 7-day | ||
HA | 4.97 | 3.69 | 3.64 | 4.96 | 3.67 | 3.63 | |
ARIMA | 4.86 | 4.89 | 4.79 | 4.83 | 4.97 | 4.86 | |
ST-net | 7.61 | 5.57 | 3.83 | 8.03 | 5.45 | 3.51 | |
SFT | NYC | 2.76 | 2.15 | 1.67 | 2.58 | 2.03 | 1.52 |
DC | 2.03 | 1.76 | 1.37 | 1.93 | 1.80 | 1.37 | |
Multi-FT | 2.29 | 1.77 | 1.35 | 2.15 | 1.73 | 1.35 | |
RT | NYC | 2.53 | 2.00 | 1.59 | 2.56 | 1.90 | 1.46 |
DC | 1.98 | 1.81 | 1.34 | 1.90 | 1.81 | 1.35 | |
MAML | 1.79 | 1.68 | 1.27 | 1.84 | 1.73 | 1.26 | |
MetaST | 1.76** | 1.61** | 1.15** | 1.72** | 1.58** | 1.14** | |
RI | 1.7% | 4.2% | 9.4% | 6.5% | 8.7% | 9.5% |
Due to the space limitation, in this table, SFT, RT and RI means single-FT, RegionTrans, and Relative Improvement, respectively.
** (*) means the result is significant according to Student’s T-test at level 0.01 (0.05) compared to MAML.
We implement our model and compare it with the baselines on taxi and bike-sharing datasets. We run 20 testing times and report the average values. The results are shown in Table 3 and Table 4, respectively. According to these tables, we draw the following conclusions.
[leftmargin=*]
Comparing with ST-net and some single-FT models, in some cases (e.g., 1-day training data), traditional time-series prediction methods (i.e., HA and ARIMA) achieves competitive performance in this problem. The reason is that traffic data show a strong daily periodicity, so that if we only have limited traffic data, we can still use periodicity to predict traffic volume.
Comparing with ST-net, all transfer learning models (i.e., fine-tune models (including Single-FT and Multi-FT models), MAML, RegionTrans, MetaST) significantly improves the performance (i.e., lower the RMSE values). The results indicate that (1) it is difficult to train a model from random initialization with limited data; (2) the knowledge transfer between cities is effective for prediction.
In most cases, Multi-FT outperforms Single-FT. One possible reason is that Multi-FT increases the diversity of source domain. In other cases (e.g., Chicago pick-up prediction with 3-day training data), the best performance of Single-FT outperforms Multi-FT. The results implicitly indicates that if there exists a source city that is optimally transferable to the target city, simply mixing other cities may hurt the performance.
RegionTrans models only slightly outperform their corresponding fine-tuning models (i.e., RegionTrans from NYC to Chicago v.s., Single-FT from NYC to Chicago). The results suggest that regional linear similarity regularization may not capture complex relationship between regions. In addition, since the data from different cities are collected from different time, regional similarity calculations may be inaccurate. Thus, it is not an effective and flexible way to transfer knowledge via regional similarity regularization.
MAML and MetaST achieve better performance than fine-tuning methods and RegionTrans. This is because fine-tuning methods and RegionTrans cannot be adapted to different scenarios, and thereby decreasing the stability of transfer. However, MAML and MetaST not only learn the initialization based on multiple cities, but also achieve the best performance in every scenario sampled from these cities.
Our proposed MetaST achieves the best performance in all experimental settings. Comparing with MAML, the averaged relative improvement is 5.5%. This is because our model can also capture and transfer long-term information, besides learning a better initialization. Moreover, the long-term pattern memory helps learn a further enhanced initialization. The stability of knowledge transfer increases to the highest degree.
Finally, we compare the results under different training data size in target city (i.e., 1-day, 3-day, and 7-day data). For every learning models (except HA and ARIMA), the performance improves with more training data. Our proposed MetaST still outperforms all baselines.
MetaST involves a number of hyper-parameters. In this subsection, we evaluate how different selections of hyper-parameters impact our model’s performance. Specifically, we analyze the impacts of two key parameters of spatial-temporal memory, i.e., the dimension of pattern representation and the trade-off factor of two losses in the joint objective. All other hyperparameters are set as introduced in Section 5.1.4. We use the scenario of 3-day data for sensitivity analysis.
![]() |
![]() |
![]() |
![]() |
For the the dimension of pattern representation, we change the dimension of pattern representation from 4 to 32 in spatial-temporal memory. The performance of Chicago taxi and bike volume prediction are shown in Figure 4(a) and Figure 4(c), respectively. We find that the performance increases in the beginning but decreases later. One potential reason is that the spatial-temporal memory provides too little information when the dimension is too small, while it can include too much irrelevant information when the dimension is too large. Both of the scenarios hurt the performance. Another experiment of trade-off factor also demonstrates similar assumption. We change the parameter in Eq. (14) from to . Higher value of means higher importance of spatial-temporal memory. The results of Chicago taxi and bike volume prediction are shown in Figure 4(b) and Figure 4(d), respectively. We can see the similar change of the performance, increasing at first but decreasing later.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
To intuitively demonstrate the efficacy of the usage of the Spatial-temporal memory, we visualize patterns detected from Boston taxi pick-up volume prediction. We also use 3-day data for this case study. We randomly select six regions and show the similarity values with respect to each pattern category in the memory . The results are shown in Figure 5(a). The corresponding actual patterns of each region are shown in Figure 5(b), Figure 5(c), Figure 5(d), Figure 5(e), Figure 5(f), and Figure 5(g) respectively. We can see that the taxi pick-up volume of R1 and R6 is almost zero and their attention weights are also similar (Pattern 1 is activated). The volume distribution in R2, R3 and R5 have one peak. The peak time of R2 is around 1:00am - 2:00am (Pattern 2 is activated). The pattern and attention weights of R3 and R5 are similar and the peak time is around 8:00pm - 9:00pm (Pattern 3 is activated). In R4, the volume distribution has two peaks (Pattern 4 is activated). One peak is around 9:00am - 10:00am, another one is 8:00pm - 9:00pm. The results indicate that the memory can distinguish regions with different patterns.
pH Data | Northeast | Southwest | South | |||||||
1-year | 3-year | 7-year | 1-year | 3-year | 7-year | 1-year | 3-year | 7-year | ||
HA | 2.302 | 2.112 | 2.016 | 3.051 | 2.811 | 2.770 | 2.402 | 2.141 | 2.028 | |
ARIMA | 2.309 | 2.328 | 2.212 | 3.153 | 3.178 | 3.103 | 2.372 | 2.363 | 2.209 | |
ST-net | 4.536 | 3.806 | 1.850 | 2.694 | 2.094 | 1.008 | 4.237 | 3.951 | 1.662 | |
Single-FT | West | 1.236 | 1.128 | 0.862 | 0.935 | 0.716 | 0.683 | 1.138 | 1.043 | 0.837 |
Midwest | 1.048 | 1.004 | 0.806 | 0.791 | 0.653 | 0.622 | 0.970 | 0.951 | 0.793 | |
Pacific | 1.249 | 1.140 | 0.854 | 0.928 | 0.711 | 0.671 | 1.172 | 1.031 | 0.835 | |
Multi-FT | 1.010 | 0.987 | 0.798 | 0.706 | 0.587 | 0.567 | 0.909 | 0.898 | 0.730 | |
RegionTrans | West | 1.233 | 1.115 | 0.853 | 0.924 | 0.698 | 0.682 | 1.121 | 0.993 | 0.826 |
Midwest | 1.047 | 0.988 | 0.796 | 0.783 | 0.651 | 0.619 | 0.965 | 0.938 | 0.769 | |
Pacific | 1.243 | 1.098 | 0.851 | 0.916 | 0.693 | 0.652 | 1.128 | 1.012 | 0.813 | |
MAML | 0.997 | 0.955 | 0.784 | 0.701 | 0.579 | 0.559 | 0.907 | 0.897 | 0.710 | |
MetaST | 0.903** | 0.898** | 0.758** | 0.649** | 0.541** | 0.514** | 0.820** | 0.803** | 0.650** | |
Relative Improvement | 9.4% | 6.0% | 3.3% | 7.4% | 6.6% | 8.1% | 9.6% | 10.5% | 8.5% |
** (*) means the result is significant according to Student’s T-test at level 0.01 (0.05) compared to MAML
The second application studied in this work is water quality prediction task. We also conduct a water quality prediction experiment. In this scenario, the water quality is represented by pH value of water, because pH value is easier to measure than other chemical metrics of water quality. We aim at predicting pH value in a specific location of next month (i.e., the time interval of water quality prediction is one month), as the changing of pH indicates the relative changes in the chemical composition in water. RMSE is still used as the evaluation metric in this task.
The data used in this experiment is collected from the Water Quality Portal666https://www.waterqualitydata.us/ of the National Water Quality Monitoring Council. It spans about 52 years from 1966 to 2017. Each record represents one surface water sample with longitude, latitude, date and pH value in a standard unit. The continental U.S. area is roughly split to six areas: Pacific, West, Midwest, Northeast, Southwest, South. Note that, in the water quality prediction task, the areas are treated as the cities in previous descriptions.
In addition, due to the sparsity of sampling points, we split each area into a grid region map, the size of each grid being °of latitude by °of longitude. Thus, the map sizes of all the six areas are 2550, 3025, 3525, 3025, 5025, 4525, respectively. The pH value of each region is represented by the median value of all sampling points in this region. We select Pacific, West, Midwest as source areas, and Northeast, Southwest, South as target areas.
We use the same baselines as in the experiments for traffic prediction. Both non-transfer methods and transfer methods are used to evaluated our algorithm. Note that, when calculating HA, the relative time is monthly. For example, if we want to predict the water quality at May, HA is the average value of all pH values at May in training data.
Hyperparameter Setting. Similar as the traffic prediction application, we do have external features in the water quality prediction task. For the learning process of spatial-temporal framework, we set the maximum of iterations as 5000, the number of filters in CNN as 32, the dimension of hidden states of LSTM as 64 and the size of memory representation as 4. Other hyperparameters are the same as traffic prediction.
Spatial-temporal Clustering. By analyzing the data, pH in the current month is strongly correlated with pH in the same month of previous year. Thus, we use the 12-month periodic pattern of each region. Similar as traffic prediction task, K-means is also used to cluster these patterns to 3 groups. DTW distance is used to measure the distance of K-means.
We implement our model and compare with baselines on the water quality prediction task. We run 20 testing tasks and report the average values in Table 5. The last row indicates the relative improvement compared with MAML. Most experiment results and their explanation are similar to traffic prediction. Besides, from this table, we observe that:
[leftmargin=0.15in]
Comparing with Multi-FT model, the performance of MAML only slightly improves in most cases. The potential reason is that the regions in the source areas may be homogeneous in short-term performance. Thus, compared to MAML which learning a initialization, simply mixing all samples (i.e., Multi-FT) may not significantly hurt the performance significantly.
MetaST achieves the best performance compared with all baselines with the averaged relative improvement as 7.7%. Since MetaST provides more detailed long-term information by spatial-temporal memory and distills the long-term information to target city, which explicitly increases the diversity of regions in source domain.
Following the same step of traffic prediction task, we investigate the effect of the dimension of pattern representation and the trade-off factor of two losses in the joint objective on MetaST performance. The performance of from 2 to 32 and from to on water quality value prediction is shown in Figure 6(a) and Figure 6(b), respectively. Both Figure 6(a) and Figure 6(b) are evaluated on Northeast water quality prediction with 3-year training data. Accordingly, MetaST achieves the best performance when and . Similar as the reasons in traffic prediction, the results indicate that suitable selection of and lead to the best performance.
![]() |
![]() |
In this paper, we study the problem of transferring knowledge from multiple cities for spatial-temporal prediction. We propose a novel MetaST model which leverages learned knowledge from multiple cities to help with the prediction in target data-scarce cities. Specifically, the proposed model learns a well-generalized initialization of spatial-temporal prediction model for easier adaptation. Then, MetaST a global pattern-based spatial-temporal memory from all source cities. We test our model on two spatial-temporal prediction problem from two different domains: traffic prediction and environment prediction. The results show the effectiveness and of our proposed model.
For future work, we plan to investigate from two directions: (1) We plan to further consider network structure (e.g., road structure) and combine it with our proposed model. For example, a simple extension is that we can use graph convolutional network as our base model; (2) We plan to explain the black-box transfer learning framework, and analyze which information is transferred (e.g., spatial correlation, region functionality).
AAAI Conference on Artificial Intelligence
. AAAI Press, 203–208.Short-term traffic flow forecasting: An experimental comparison of time-series analysis and supervised learning.
IEEE Transactions on Intelligent Transportation Systems 14, 2 (2013), 871–882.Adapting SVM classifiers to data with shifted distributions. In
Data Mining Workshops, 2007. ICDM Workshops 2007. Seventh IEEE International Conference on. IEEE.