Recurrent Multi-Graph Neural Networks for Travel Cost Prediction

11/13/2018 ∙ by Jilin Hu, et al. ∙ Aalborg University 0

Origin-destination (OD) matrices are often used in urban planning, where a city is partitioned into regions and an element (i, j) in an OD matrix records the cost (e.g., travel time, fuel consumption, or travel speed) from region i to region j. In this paper, we partition a day into multiple intervals, e.g., 96 15-min intervals and each interval is associated with an OD matrix which represents the costs in the interval; and we consider sparse and stochastic OD matrices, where the elements represent stochastic but not deterministic costs and some elements are missing due to lack of data between two regions. We solve the sparse, stochastic OD matrix forecasting problem. Given a sequence of historical OD matrices that are sparse, we aim at predicting future OD matrices with no empty elements. We propose a generic learning framework to solve the problem by dealing with sparse matrices via matrix factorization and two graph convolutional neural networks and capturing temporal dynamics via recurrent neural network. Empirical studies using two taxi datasets from different countries verify the effectiveness of the proposed framework.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Origin-destination (OD) matrices (Ekowicaksono et al., 2016; Hu et al., 2018)

are applied widely in location based services and online map services (e.g., transportation-as-a-service), where OD matrices are used for the scheduling of trips, for computing payments for completed trips, and for estimating arrival times. For example, Google Maps

111 and ESRI ArcGIS Online222 offer OD matrix services to help developers to develop location based applications. Further, increased urbanization contributes to making it increasingly relevant to capture and study city-wide traffic conditions. OD matrices may also be applied for this purpose.

To use OD-matrices, a city is partitioned into regions, and a day is partitioned into intervals. Each interval is assigned its own an OD-matrix, and an element , in matrix described the attribute (e.g., travel speed, fuel consumption (Guo et al., 2015, 2012), or travel demand) of travel from region to region during the interval that the matrix represents. Different approaches can be applied to partition a road network (Ding et al., 2016; Yang et al., 2009), e.g., using a uniform grid or using major roads, as exemplified in Figure 1. In this paper, we focus on speed matrices. However, the proposed techniques can be applied on other travel attributes or costs, such as travel time, fuel consumption, and travel demand.

(a) Grid-based Partition
(b) Road-based Partition
Figure 1. Partition a City into Regions

As part of the increasing digitization of transportation, increasiingly vast volumes of vehicle trajectory, trajectory data are becoming available (Guo et al., 2014; Ding et al., 2015). We aim to exploit such data for composing OD matrices. Specifically, an element , of a speed matrix for a given time interval can be instantiated from the speeds observed in trajectories that went from region to region during the relevant time interval.

We consider stochastic OD matrices where the elements represent uncertain costs by meaning of cost distributions rather than deterministic, signle-valued costs. The use of distribution models reality better and enables more reliable decision-making. For example, element , has a speed histogram , ,

, meaning that the probability of traveling speed from region

to region at 0-20 km/h is 0.5, at is 0.3, and at is 0.2, respectively. If a passenger needs to go from his home in region to catch a flight in an airport in region , and the shortest path from his home to the airport is 20 km, then we are able to derive a travel time (minutes) distribution: , , . Therefore, the passenger needs to reserve at least 120 minutes for not being late. However, when only using average speed to derive an average travel time of 54 minutes, it makes the passenger runs into a risk of missing the flight.

We address the problem of stochastic origin-destination matrix forecasting—based on historical stochastic OD-matrices, we predict future OD-matrices. Figure 2 shows a specific example: given stochastic OD-matrices for 3 historical intervals , , and , we aim at predicting the stochastic OD-matrices for the 3 future intervals , , and .

Figure 2. Stochastic Origin-Destination Matrix Forecasting

Here, a stochastic OD-matrix is represented as a 3-dimensional tensor, where the first dimension represents source regions, the second dimension represents destination regions, and the third dimension represents cost ranges. For example, Figure 

2(b) shows the stochastic OD-matrix for interval , which is represented as a

tensor with 8 source regions, 8 destination regions, and 3 speed (km/h) ranges [10, 20), [20-30), and [30-40]. Element (8, 8) in the OD-matrix is a vector (0.3, 0.5, 0.2), meaning that, when traveling within region 8, the travel speed histogram is

, , .

Solving the stochastic OD-matrix forecasting problem is non-trivial as it is necessary to contend with two difficult challenges.
(1) Data Sparseness.

To instantiate a stochastic OD-matrix in an interval using trajectories, we need to have sufficient trajectories for each region pair during the interval. However, even massive trajectory data sets are often spatially and temporally skewed 

(Wang et al., 2016; Jindal et al., 2017; Wang et al., 2014; Wang et al., 2018a; Guo et al., 2018), making it almost impossible to cover all region pairs for all intervals.

For example, the New York City taxi data set333 we use in experiments has more than 29 million trips from November and December 2013. Yet, this massive trip set only covers 65% of all “taxizone” pairs in Manhattan, the most densely traversed region in New York City. If we further split the data set according to the temporal dimension, e.g., into 15-min intervals, the spareness problem becomes even more severe.

The data sparseness in turn results in sparse historical stochastic OD-matrices, where some elements are empty (e.g., those elements with “?” in Figure 2(b)). Yet, decision making requires full OD-matrices. The challenge is how to use sparse historical OD-matrices to predict full future OD-matrices.

(2) Spatio-temporal Correlations. Traffic is often spatio-temporally correlated—if a region is congested during a time interval, its neighboring regions are also likely to be congested in subsequent intervals. Thus, to predict accurate OD-matrices, we need to account for such spatio-temporal correlations. However, the OD-matrices themselves do not necessarily capture spatial proximity. No matter which partition method is used, we cannot always guarantee that two gegraphically adjacent regions are represented by adjacent rows and columns in the matrix. For example, in Figure 1(a), regions 1 and 4 are geographically adjacent, but they are not adjacent in the OD matrices; in Figure 1(b), regions 4 and 7 are adjacent but they are again not adjacent in the OD matrices. This calls for a separate mechanism that is able to take into account the geographical proximity of regions.

We propose a data-driven, end-to-end deep learning framework to forecast stochastic OD matrices that aims to effectively address the challenges caused by data sparseness and spatio-temporal correlations. First, to address the data sparseness challenge, we factorize a sparse OD matrix into two small dense matrices with latent features of the source regions and the destination regions, respectively. Second, we model the spatial relationships among source regions and among destination regions using two graphs, respectively. Then, we employ two graph covolutional, recurrent neural networks (GR) on the two dense matrices to capture the spatio-temporal correlations. Finally, the two GRs predict two dense, small matrices. We apply the multiplication to the two dense, small matrices to obtain a full predicted OD-matrix.

To the best of our knowledge, this is the first study of stochastic OD matrix forecasting that contends with data sparseness and spatio-temporal correlations. The study makes four contributions. First, it formalizes the stochastic OD matrix forecasting problem. Second, it proposes a generic framework to solve the problem based on matrix factorization and recurrent neural networks. Third, it extends the framework by embedding spatial correlations using two graph convolutional neural networks. Fourth, it encompasses an extensive experiments using two real-world taxi datasets that offers insight into the effectiveness of the framework.

The remainder of the paper is organized as follows. Section 2 covers related works. Section 3 defines the setting and formalizes the problem. Section 4 introduces a basic framework and Section 5 presents an advanced framework. Section 6 reports experiments and Section 7 concludes.

2. Related Work

2.1. Travel Cost Forecasting

We consider three types of travel cost forecasting methods in turn: segment-based methods (Rice and Van Zwet, 2001; Yuan et al., 2011; Yang et al., 2013; Wu et al., 2004; Yang et al., 2014a, 2015; Hu et al., 2017), path-based methods (Yang et al., 2018; Dai et al., 2016; Wang et al., 2014; Wang et al., 2018a; Wang et al., 2018b; Zhang et al., 2018), and OD-based methods (Wang et al., 2016; Li et al., 2018a; Jindal et al., 2017).

Segment-based methods focus on predicting the travel costs of individual road segments. For example, by modeling the travel costs of a road segment as a time series, techiniques such as time-varying linear regression 

(Rice and Van Zwet, 2001)

, Markov models 

(Yuan et al., 2011; Yang et al., 2013), and support vector regression (Wu et al., 2004)

can be applied to predict future travel costs. Most such models consider time series from different edges independently. As an exception, the spatio-temporal Hidden Markov model 

(Yang et al., 2013) takes into account the correlations among the costs of different edges. Some other studies focus on estimating high-resolution travel costs, such as uncertain costs (Yang et al., 2014a; Hu et al., 2017) and personalized costs (Yang et al., 2015; Dai et al., 2015). The data sparseness problem has also been studied—methods exist to estimate travel costs for segments without traffic data (Hu et al., 2019; Yang et al., 2014b).

Path-based methods focus on predicting the travel costs of paths. A naive approach is to predict the costs of the edges in a path and then aggregate the costs. However, this approach is inaccurate since it ignores the dependencies among the costs of different edges in paths (Dai et al., 2016; Wang et al., 2014). Other methods (Yang et al., 2018; Dai et al., 2016; Wang et al., 2014) use sub-trajectories to capture such dependencies and thus to provide more accurate travel costs for paths. In particular, the PAth-CEntric (PACE) model is proposed to utilize sub-trajectories that may overlap to achieve the optimal accuracy (Yang et al., 2018; Dai et al., 2016), whereas the other study only considers non-overlapping sub-trajectories (Wang et al., 2014). A few studies propose variations of deep neural networks (Wang et al., 2018a; Wang et al., 2018b; Zhang et al., 2018) to enable accurate travel-time prediction for paths.

Finally, OD-based methods aim at predicting the travel cost for given OD pairs. Our proposal falls into this category. A simple and efficient baseline (Wang et al., 2016) is to compute a weighted average over all historical trajectories that represent travel from the origin to the destination in an OD pair. However, it does not address data sparseness, which means that if no data is available for a given OD pair, it cannot provided a prediction. In contrast, our proposal is able to predict full OD-matrices without empty elements based on historical, sparse OD-matrices. A recent study (Li et al., 2018a) utilizes deep learning and multi-task learning to predict OD travel time while taking into account considers the road network topology and the paths used in the historical trajectories. However, path information may not always be available. An example is the New York taxi data set that we use in the experiments. This reduces the applicability of the model. In contrast, our proposal does not require path information. Further, existing proposals support only deterministic costs, while our proposal also supports stochastic costs.

2.2. Graph Convolutional Neural Network

Convolutional Neural Networks (CNNs) have been used successfully in the contexts of images (Sermanet et al., 2012), videos (Le et al., 2011), speech (Hinton et al., 2012), time series (Cirstea et al., 2018; Kieu et al., 2018a), and trajectories (Kieu et al., 2018b), where the underlying data is represented as a matrix (Defferrard et al., 2016; Bruna et al., 2013). For example, when representing an image as a matrix, nearby elements, e.g., pixels, share local features, e.g., represent parts of the same object. In constrast, in our setting, an OD-matrix may not satisfy the assumption that helps make CNNs work—two adjacent rows in an OD matrix may represent two geographically distant regions and may not share any features; and two separated rows in an OD matrix may represent geographically close regions that share many features.

Graph convolutional neural networks (GCNNs) (Defferrard et al., 2016; Bruna et al., 2013) aim to address this challenge. In particular, the geographical relationships among regions can be modeled as a graph, and GCNNs then take into account the graph while learning. One study (Kipf and Welling, 2016)

applies GCNNs to solve semi-supervised classification in the setting of citation networks and knowledge graphs. One study continues to study semi-supervised classification via dual graph convolutional networks 

(Zhuang and Ma, 2018). Another study (Yu et al., 2017) constructs GCNNs together with a Recurrent Neural Network (RNN) to forecast traffic and one recent study (Hu et al., 2019) utilizes GCNNs to fill in travel time for edges without traffic data. All the above studies consider a setting where only one dimension needs to be modeled as a graph. In contrast, in our study, both dimensions, i.e., the source region dimension and the destination region dimension, need to be modeled as two graphs. An additional, recent study focuses on so-called geomatrix completion which considers a similar setting where two dimensions need to be modelded as two graphs. It uses multi-graph neural networks (Monti et al., 2017) with RNNs. However, the RNNs in this study are utilized to perform iterations to approximate the geomatrix completion, not to capture temporal dynamics as in our study. To the best of our knowledge, our study is the first that constructs a learning framework involving dual-graph convolution and employing RNNs to forecast the future.

3. Preliminaries

3.1. OD Stochastic Speed Tensor

A trip is defined as a tuple , where denote an origin and a destination, is a departure time, represents the trip distance, and is the travel time of the trip. Given and , we derive the average travel speed of . We use to denote a set of historical trips.

To capture the time-dependent traffic, we partition the time domain of interest, e.g., a day, into a number of time intervals, e.g., 96 15-min intervals. For each time interval , we obtain the set of historical trips from whose departure times belong to time interval , i.e., .

We further partition a city into regions . An Origin-Destination (OD) pair is defined as a pair of regions where .

Given a time interval , two regions and , we obtain a trip set , meaning that each trip in starts from region , at a time in interval , and ends at region .

Next, we construct an equi-width histogram to record the stochastic speed of trips in . In particular, an equi-width histogram is a set of bucket-probability pairs, i.e., . A bucket represents the speed range from to , and all buckets have the same range size. Probability is the probability that the average speed of a trip falls into the range . For example, the speed histogram , , for means that the probabilities that the average speed (km/h) of a trip in falls into , , and are 0.5, 0.3, and 0.2, respectively.

Definition 3.1 ().

Given a time interval , an OD stochastic speed tensor is defined as a matrix , where the first and second dimensions range over the origin and destination regions, respectively, and the third dimension ranges over the stochastic speeds. For generality, the origin and destination regions can be the same or can be different; thus the first and second dimensions have and instances, respectively. The third dimension defines speed buckets.

represents the element of tensor and represents the probability of trips in traveling at an average speed that falls into the -th bucket.

Following the example in Figure 2(b), given a time interval , for origin region 7 and destination region 8, we obtain a stochastic speed of trips as a histogram, in which the first bucket records that the probability of trips, starting at region 7 during time interval and ending at region 8, traveling at an average speed of is 0.3.

As shown in Figure 2(b), not all cells have a histogram to capture the stochastic speed. Specifically, the cells with question marks have no histograms because no trip records are available for those cells, i.e., , so that . We refer to such tensor as sparse OD stochastic speed tensor.

Given a time interval , we refer to a tensor where each cell has a stochastic speed as a full OD stochastic speed tensor.

3.2. Problem Definition

Given sparse OD stochastic speed tensors , , during historical time intervals , , , we aim to predict the stochastic speeds for the next time intervals , , in the form of full OD stochastic speed tensors , , by learning the following function .

4. Basic Stochastic Speed Forecasting

4.1. Framework and Intuition

Figure 3 shows the basic framework for forecasting stochastic speeds, which consists of three steps: Factorization, Forecasting, and Recovering.

For the historical time intervals , , , we have sparse OD stochastic speed tensors , , . We factorize each stochastic speed tensor , where into two smaller tensors and , where . The aim is to use and to approximate . Here and model the correlated features of stochastic speeds among origin regions and among destination regions, respectively. And it is intuitive to assume that stochastic speeds among origin regions and among destination regions share correlated features, as traffic in a region affects the traffic in its nearby regions.

The factorization is supported by the intuition underlying low-rank matrix approximation (Monti et al., 2017; Srebro et al., 2005; Srebro and Jaakkola, 2003; Marlin, 2004). Since is a sparse tensor, we aim to find a low-rank tensor to approximate . When carrying out the approximation, we assume that the rank of is at most and that it can be factorized as . Then, the problem of using to approximate

can be formulated as the problem of minimizing the following loss function.


where , denotes the Frobenius norm, if the element of is not empty, and is the element-wise tensor multiplication.

Next, we consider as an input sequence, from which we capture the temporal correlations among the origin regions of , . We feed this input sequence into a sequence-to-sequence RNN model (Sutskever et al., 2014) to forecast an output sequence that represents the shared features among the origin regions in the future.

We apply a similar procedure to to forecast an output sequence that represents the shared features among destination regions in the future.

Finally, we recover as a full OD stochastic speed tensor from and , . Since we obtain the predictions and from the historical and , , the intuition of Equation 1 also applies when reconstructing .

Figure 3. Framework Overview.

4.2. Factorization

Given an input sparse OD stochastic tensors for intervals , where , we proceed to describe the method for factorizing into and , which are able to capture the correlated features of stochastic speed among origin and destination regions, respectively. We first flatten into a vector , where , from which we generate two small factorization vectors, and via a fully-connected neural network layer (FC layer).


Here and are parameter matrices, where is a hyper-parameter to be set; and

are bias vectors; and

is the relu activation function.

Next, we reorganize the factorization vectors and into factorization tensors and , respectively.

4.3. Forecasting

Given historical time intervals , we learn the temporal correlations of from the temporal correlations among origin regions and the temporal correlation among destination regions .

Based on , we use a sequence-to-sequence RNN model (Sutskever et al., 2014) to forecast for the future time intervals ,

. In particular, we apply Gated Recurrent Units (GRUs) in the RNN architecture, since these can capture temporal correlations well by using gate units well and also offer high efficiency 

(Cho et al., 2014; Chung et al., 2014). The process is presented as follows.


A similar procedure is applied to obtain from .

4.4. Recovery

Given predicted tensors and for a future time interval , with , we proceed to describe how to transform and into a full OD stochastic speed tensor .

First, we slice each of and by the speed bucket dimension into matrices. Specifically, we have and , where and , .

Next, we conduct a matrix multiplication as follows.


where , .

Finally, we are able to construct a tenor by combining a total of matrices, i.e., , . Now, is a full tensor where each element has a value.

A histogram must meet two requirements to be a meaningful histogram: (1) , , meaning that the probability of a speed falling into the -th bucket for each OD pair must between 0 and 1; and (2) , meaning that the probability of a speed falling into all buckets for each must equal 1.

To achieve this, we apply a softmax function to normalize values in into that satisfies the histogram requirements.


Thus, we obtain meaningful full OD stochastic speed tensors for the future time intervals as the output of the recovery process: .

4.5. Loss Function

The loss function is defined as the error between the recovered future tensor and the ground-truth future tensor.


where and represent the training parameters in the framework, and is a regularization parameter. Further, is an indication tensor, where if the OD pair is not empty in the ’th future interval. Note that although we aim to predict full tensors, the ground truth tensors are sparse, so we compute the errors taking only into account the non-empty elements in the ground truth tensors. Next, is element-wise multiplication, and are predicted and ground truth tensors, respectively, is the Frobenius-norm.

5. Forecast with Spatial Dependency

To improve forecast accuracy, we proceed to integrate spatial dependency into our framework in two different stages. First, in the factorization step, we apply graph convolutional neural networks to perform feature encoding for origin and destination dimensions, respectively. Second, in the forecasting step, we integrate graph convolutional with RNNs to capture spatio-temporal correlations.

5.1. Spatial Factorization

As in Section 4.2, we aim to factorize tensor during interval , , into two smaller tensors and . In Section 4.2, is simply flattened and followed by a fully-connected layer to construct and . This process does not take spatial correlations among the origin regions and among the destination regions into account, although such correlations are likely to exist. To accommodate spatial correlations, we first capture spatial correlations among origin and destination regions; then we use the captured spatial correlations to conduct factorization.

5.1.1. Spatial Correlation

We leverage the notion of a proximity matrix (Li et al., 2018b) to capture spatial correlations. We proceed to present the idea using origin regions as an example, which also applies to destination regions in a similar manner.

Given , we have origin regions, from which we build an adjacency matrix to show region connections. Specifically, means that regions and are adjacent; otherwise, .

We construct a weighted proximity matrix from that describes the proximity between regions and and is parameterized by adjacency hops

and standard deviation

. Specifically, if can be reached from in adjacency hops using , , where is the distance between the centroid of and ; otherwise . In the experiments, we study the effect of and (see Section 6.2.4). The proximity matrix is symmetric and non-negative.

The adjacency matrices for the source regions and destination regions may be different or the same. Consider two scenarios. First, we use OD matrices to model the travel costs within a city. In this case, the source regions and the destination regions are the same, and thus the two adjcency matrices are the same. Second, we may use OD matrices to model the travel costs between two different cities. Then, the source regions and the destination regions are in different cities. Thus, we need two different adjacency matrices. To avoid confusion, we use and represent the adjacency matrices for source regions and destination regions, respectively.

5.1.2. Factorization

We proceed to show the factorization procedure. Specifically, we show how to obtain from . The same procedure can be applied to obtain .

As shown in Figure 4(a), we first slice by the origin region dimension into matrices, i.e., . Each of the sliced matrix is then applied with a GCNN operation. Accordingly, we obtain the GCNN output as , . We then concatenate this to obtain .

Figure 4. Spatial Factorization for

Figure 4(b) shows a GCNN operation on a sliced matrix , which transforms into via Filtering and Pooling.

Filtering: Given , we apply graph convolutional filters, which take into account the destination region adjacency matrix , to generate that captures the correlated features among destination regions.

We first slice into vectors , , , where vector , , represents the probability of speeds falling into the -th speed bucket when traveling from origin region to all destination regions.

Next, we use a specific graph convolutional filter, namely Cheby-Net (Defferrard et al., 2016), due to its high accuracy and efficiency, on each vector . Specifically, before conducting actual convolutions, we compute , where , , from . Here, , , and when , where is a scaled Laplacian matrix and where is the Laplacian matrix and

is the maximum eigenvalue of

. Here, we use destination adjacency matrix because represents the speed from source region to all destination regions and we use to capture the spatial correlation among destination regions. After the whole computation, we get as the encoded features for the -th bucket while considering the spatial correlations among destination regions.

Then we proceed to apply filters to . Each filter is a vector , where . We apply each filter to all , , and then the sum is used as the output of the filter.


where is the Cheby-Net graph convolution operation, is a bias vector, and is a non-linear activate function.

Finally, we arrange the results obtained from all filters as , , , where .

Pooling: To further condense the features and to construct the final factorizations, we apply geometrical pooling (Defferrard et al., 2016) to over the destination region dimension to obtain , where and

are the pooling and stride size, repectively. This process is shown as follows.



is the pooling function that can be either max pooling or average pooling.

Since the pooling operation requires meaningful neighborhood relationships, we identify spatial clusters of destination regions. For example, in Figure 1(b), if we use the order of ascending region ids, i.e., (1, 2, 3, 4, 5, 6, 7, 8) to conduct pooling with a pooling size of 2, then regions 3 and 4 are pooled together. However, regions 3 and 4 are not neighbors, so this procedure may yield inferior features that may in turn yield undesired results. Instead, if we identify clusters of regions, we are able to produce a new order, e.g., (6, 1, 2, 3, 5, 4, 7, 8). When again using a pooling size of 2, each pool contains neighbouring regions.

The GCNN process, including filtering and pooling, is repeated several times with different numbers of filters and pooling stride size . Eventually, we set and get .

As shown in Figure 4(b), the last operation is concatenation. We slice by the origin region dimension into matrices and apply GCNN to each of them to obtain , where each . We then concatenate the , , to obtain .

The same procedure can be applied to obtain where we need to change to when conducting the graph convolution.

5.2. Spatial Forecasting

To model temporal dynamics while keeping the spatial correlations in RNNs, we combine Cheby-Net based graph convolution with RNNs, yielding CNRNNs. Intuitively, we follow the structure of gated recurrent units while replacing the traditional fully connected layer by a Cheby-Net based graph convolution layer. Separate CNRNNs are employed to process and .

Taking the source region dimension as an example, a CNRNN takes as input at time interval , and it predicts for the future time interval . This procedure is formulated as follows.


where , , and are graph convolution filters; and are the input and output of a CNRNN cell at time interval , repectively; and are the reset and update gates, respectively; denotes the graph convolution which defined in Equation 8, and here the graph convolution should take into account source adjacency matrix since captures features of source regions. denotes the Hadamard product between two tensors; and , , and are non-linear activation functions.

When applying CNRNN to predict , we need to change to when conducting the graph convolution as captures features of destination regions.

Given predicted factorization tensors and , we apply the same recovery operation introduced in Section 4.4 to obtain full OD stochastic speed tensors for the future time intervals as the recovery output: .

5.3. Loss Function

Similar to the construction covered in Section 4.5, we present the loss function as follows.


where and represent the training parameters in the framework (in particular, graph convolutional filters and bias vectors), is the Dirichlet norm under the proximity matrix , is the regularization parameter for the Dirichlet norm. We use the Dirichlet norm because it takes the adjacency matrix into account—nearby regions should share similar features in the dense tensors and . Finalltm is an indication tensor, and and , , are the predicted and ground truth tensors, respectively.

6. Experiments

We describe the experimental setup and then present the experiments and the findings.

6.1. Experimental Setup

6.1.1. Datasets

We conduct experiments on two taxi trip datasets to study the effectiveness of the proposal.

We represent a stochastic speed (m/s) as a histogram with 7 buckets , , , , , , and ; and we consider 15-min intervals, thus obtaining 96 15-min intervals per day.

(a) NYC Regions
(b) CD Regions
(c) NYC Speed
(d) CD Speed
Figure 5. Region Representations of NYC and CD
# Trips 14,165,748 3,636,845
# Regions 67 79
Average Speed 8.8 m/s 6.0 m/s
Table 1. Statistics of the Data Sets.

New York City Data Set (NYC): We use 14 million taxi trips collected from 2013-11-01 to 2013-12-31 from Manhattan, New York City. Each trip consists of a pickup time, a drop off time, a pickup location, a drop off location, and a total distance. Manhattan has 67 taxizones444, each of which is used as a region. The regions are shown in Figure 5(a). The OD stochastic speeds for NYC are represented as an tensor.

Chengdu Data Set (CD): CD contains 1.4 billion GPS records from 14,864 taxis collected from 2014-08-03 to 2014-08-30 in Chengdu, China555 Each GPS record consists of a taxi ID, a latitude, a longitude, an indicator of whether the taxi is occupied, and a timestamp. We consider sequences of GPS records where taxis were occupied as trips. We use a total of 3,636,845 trips that occurred within the second ring road of Chengdu. Next, we partition Chengdu within the second ring road into 79 regions according to the main roads; see Figure 5(b). The OD stochastic speeds for CD are represented as an tensor.

Table 1 shows the statistics of the two datasets. Figures 5(d) and 5(d) show the speed distributions for both datasets. We use 70% of the data for training, 10% for validation, and the remaining 20% for testing for both NYC and CD.

6.1.2. Forecast Settings

We consider settings where or while varying among 1, 2, and 3. This means that we use stochastic OD matrices from 3 or 6 historical intervals to predict stochastic OD matrices during up to 3 future intervals, respectively. An example for and can be: given stochastic OD matrices in intervals [8:00, 8:15), [8:15, 8:30), [8:30, 8:45), [8:45, 9:00), [9:00, 9:15), and [9:15, 9:30), we predict stochastic OD matrices in intervals [9:30, 9:45), [9:45, 10:00), and [10:00, 10:15).

6.1.3. Baselines

To evaluate the effectiveness of the proposed base framework (BF) and the advanced framework (AF), we consider five baselines. (1) Naive Histograms (NH): for each OD pair, we use all travel speed records for the OD pair in the training data set to construct a histogram and use the histogram for predicting the future stochastic speeds. Next, we model the stochastic speeds for each OD pair as a time series of vectors, where each vector represents the stochastic speed of the OD pair in an interval. Based on this time series modeling, we consider three time series forecasting methods: (2) Support Vector Regression (SVR) (Basak et al., 2007), (3) Vector Autoregression (VAR) (Sims, 1980), and (4) Gaussian Process Regression (GP) (Rasmussen, 2004). (5) Fully Connected (FC): this is a variant of BF where we only directly use a fully connected layer to obtain a single dense tensor (instead of performing factorization into two dense tensors) to replace the factorization step in BF.

6.1.4. Evaluation Metrics

To quantify the effectiveness of the proposed frameworks, we use three commonly used distance functions that work for distributions, i.e., Kullback-Leibler divergence (KL), Jensen-Shannon divergence (JS), and earth-mover’s distance (EMD), to measure the accuracy of forecasts.

Specifically, the general dissimilarity metric is defined as follows.