RSODP
OriginDestination Prediction for Ridesharing System
view repo
In recent years, ridehailing services have been increasingly prevalent as they provide huge convenience for passengers. As a fundamental problem, the timely prediction of passenger demands in different regions is vital for effective traffic flow control and route planning. As both spatial and temporal patterns are indispensable passenger demand prediction, relevant research has evolved from pure time series to graphstructured data for modeling historical passenger demand data, where a snapshot graph is constructed for each time slot by connecting region nodes via different relational edges (e.g., origindestination relationship, geographical distance, etc.). Consequently, the spatiotemporal passenger demand records naturally carry dynamic patterns in the constructed graphs, where the edges also encode important information about the directions and volume (i.e., weights) of passenger demands between two connected regions. However, existing graphbased solutions fail to simultaneously consider those three crucial aspects of dynamic, directed, and weighted (DDW) graphs, leading to limited expressiveness when learning graph representations for passenger demand prediction. Therefore, we propose a novel spatiotemporal graph attention network, namely Gallat (Graph prediction with all attention) as a solution. In Gallat, by comprehensively incorporating those three intrinsic properties of DDW graphs, we build three attention layers to fully capture the spatiotemporal dependencies among different regions across all historical time slots. Moreover, the model employs a subtask to conduct pretraining so that it can obtain accurate results more quickly. We evaluate the proposed model on realworld datasets, and our experimental results demonstrate that Gallat outperforms the stateoftheart approaches.
READ FULL TEXT VIEW PDFOriginDestination Prediction for Ridesharing System
Transportation plays a very important role in our daily lives. In 2018, commuters in Beijing spend about 112 minutes behind the wheel on average every day. With the prominent development of technologies like GPS and mobile Internet, various ridehailing applications have emerged to provide drivers and passengers with more convenience, such as Didi, Lyft and Uber. For all ridehailing platforms, analyzing and predicting the realtime passenger demand is the key to highquality services, which has recently started to attract considerable research attention.
Initially, the majority of studies (Tong et al., 2017; Yao et al., 2018; Wang et al., 2019a) treat passenger demand prediction as a time series prediction problem which predicts the number of passenger demands in an arbitrary location during a given time period. However, such prediction paradigm only considers the origin of each passenger’s trip, and neglects the destination information. To account for this important aspect in passenger demand prediction, recent research (Deng et al., 2016; Hu et al., 2020) defines it as an OriginDestination Matrix Prediction (ODMP) problem. In ODMP, each time slot has its own OD matrix, where the element indexed by describes the travel demand from region to region
. In this regard, tensor factorization models
(Gong et al., 2018; Hu et al., 2020)and convolutional neural networks (CNNs)
(Li et al., 2018a; Liu et al., 2019)can be conveniently adopted to extract latent representations from the OD matrices to support prediction. Additionally, ODMP not only estimates the number of passenger demands within the target area, but also foresees where these demands go, making it easier for ridehailing companies to coordinate travel resources (i.e., vehicles) to maximize customer satisfaction and business revenue.
More recently, research on passenger demand has introduced a new perspective by modelling the traffic data as graphs where different regions are viewed as nodes. Compared with OD matrices, graphstructured data is able to uniform various heterogeneous information to help boost the prediction accuracy. For instance, two region nodes can be connected by different edges representing specific relationships, e.g., origindestination relationship in a time slot (Shi et al., 2020), geographical association (e.g., adjacency regions) (Wang et al., 2019b), and even functionality similarity by comparing their pointofinterest distributions (Geng et al., 2019). With the recent advances in graph neural networks (GNNs) (Hamilton et al., 2017; Veličković et al., 2017; Kipf and Welling, 2017), GNNbased models have emerged and yielded stateoftheart performance in a wide range of passenger demand prediction and traffic modelling tasks (Shi et al., 2020; Wang et al., 2019b; Geng et al., 2019; Cui et al., 2018). On one hand, GNNbased models are capable of mining the complex and heterogeneous relationships between regions, thus thoroughly capturing the latent properties of each region to allow for accurate demand prediction. On the other hand, GNNbased models are versatile as they generalize the convolutional operations in CNNs to the nonEuclidean graph topologies without the need for conversion into OD matrices at each time slot.
When modelling traffic data as graphs, a common practice is to construct a snapshot graph for each time slot. Consequently, this results in three major intrinsic properties of such graphstructured data, i.e., the constructed graphs are dynamic, directed, and weighted. In our paper, this notion is referred to as dynamic, directed, and weighted (DDW) graphs, which contain both the spatial and temporal information across regions. From a temporal perspective, DDW graphs are timesensitive due to complex reallife situations (e.g., peak hour traffic, special events, etc.), making it nontrivial to fully capture their dynamics. For every DDW snapshot graph, two regions are linked via an edge if there are observed trip orders between them, thus allowing GNNbased models to capture signals of spatial passenger flows. However, on top of that, in an origindestination relationship, the edge between two region nodes should be directional. For example, a central business district tends to have substantially more inbound traffic flows than outbound flows during morning peak hours. Also, as the volume of passenger demand varies largely among different routes, those origindestination edges should preserve such information as their weights. Unfortunately, existing methods tend to oversee these two important edge properties, leading to severe information loss. For instance, graphstructured data is used to extract temporal features of each region (Geng et al., 2019), but neither the direction nor the volume of passengers is captured by the constructed graphs. Though (Wang et al., 2019b) considers the weights of edges when learning representations for each region nodes, it simply treats the passenger flows between two nodes from both directions equally without distinguishing their semantics.
Meanwhile, as a widely reported issue in passenger demand research (Xue et al., 2015; Wang et al., 2019b; Hu et al., 2020), the geographically imbalanced distribution of trip orders inevitably incurs sparsity issues within the constructed graphs. Due to varied locations and public resource allocations, the observable passenger flows from/to some regions (e.g., rural areas) are highly scarce. While most existing studies seek solutions by gathering side information from multiple data sources (e.g., pointofinterests (Geng et al., 2019; Wang et al., 2019a), weather (Liao et al., 2018; Wang et al., 2019b), realtime events (Tong et al., 2017), etc.), it is impractical to assume the constant availability and high quality of such auxiliary data. Hence, it further highlights the necessity of fully capturing the latent patterns and comprehensively modelling the information within DDW graphs constructed from historical trip orders.
To this end, we propose Gallat, namely Graph prediction with all at
tention, which is a novel spatiotemporal graph neural network for passenger demand prediction. Specifically, having the selfattention as its main building block, Gallat consists of three main parts: the spatial attention layer, the temporal attention layer, and the transferring attention layer. In the spatial attention layer, we learn each region node’s representation by discriminatively aggregating information from its three types of neighbor nodes with attention mechanism. To be specific, we innovatively define forward and backward neighborhoods respectively for its outbound and inbound travel records to distinguish the different semantics of two directions. The geographical neighborhood is also defined for each node to gather information from nearby regions and help alleviate data sparsity. In the temporal attention layer, we deploy a multichannel selfattention mechanism to retrieve important contextual information from the history, which is then used to augment the learned node representations for demand prediction in the next time slot. With the generated representation of each node from the first two layers, we firstly predict the total number of passenger demands in each region, and then distribute it to all possible destination regions via a transferring probability generated in the final transferring attention layer. By taking advantages of the attention mechanism, our model Gallat is highly expressive, and is able to capture the heterogeneous connectivity among regions to yield optimal prediction performance.
The main contributions of this work are as follows:
We investigate passenger demand from the perspective of DDW graphs, with a comprehensive take on modelling the dynamic, directed, and weighted properties within the passenger flow data simultaneously.
We propose a novel spatiotemporal graph neural network named Gallat, which is an inductive solution to representation learning on DDW graphs for passenger demand prediction.
Extensive experiments are conducted on two realworld datasets, and the results demonstrate the superior effectiveness of our proposed model.
In this section, we provide key definitions and formally define the passenger mobility prediction problem.
Time Slot. We evenly partition the time into a sequence of slots, which are represented as . The interval between any two consecutive slots is constant. For example, we can divide a day into 24 onehour time slots.
Node. The entire area of interest like a city is divided into nonoverlapping regions. Each region is regarded as a node, and the node set of the specific city can be denoted as . Following (Wang et al., 2019b; Hu et al., 2020; Yao et al., 2018), we determine regions by evenly diving a whole area into grids according to their longitudes and latitudes. Then, we calculate the physical distance between each pair of nodes using their central coordinates, which is stored in an adjacency matrix R. Every element represents the geographical distance between node and node .
Dynamic, Directed, and Weighted graph. In a time slot, the passenger mobility in the region of interest can be modeled as interactions between nodes. Given a fixed region node set , we use to denote the directional edge from to , where is the weight of the edge. In our paper, each is directly defined as the number of passenger demands from region to in a specific time slot. If there are no trip orders from to , then denotes a nonexisting edge. We use a sequence of adjacency matrices to represent all DDW graphs in all time slots where .
Here, we formulate the passenger mobility prediction problem as follows:
Passenger Mobility Prediction. For a fixed region node set , given all DDW snapshot graphs in the past time slots and the geographical relationship R among nodes, we define passenger mobility prediction as a DDW graph prediction problem, which aims to predict the DDW snapshot graph in the next time slot.
Notation  Description 
a time slot  
the total number of time slots in a sequence  
a node set  
a node  
the matrix of all nodes’ feature vectors at 

the feature vector of node at time slot ,  
R  an adjacency matrix of nodes’ geographical relationship 
the geographical distance between node and ,  
an adjacency matrix of DDW graphs at time slot  
the passenger demand between node and ,  
the sets of node ’s forward, backward and geographical neighbors’ indexes at time slot  
the threshold of the distance to determine the size of ’s geographical neighborhood  
the matrix of all nodes’ representation vectors  
node ’s representation vector at time slot ,  
, and  attentive weights between three different neighbors of and 
, ,  preweights of different kinds of neighbors 
the number of historical time slots considered in each channel  
a channel’s sequence of ,  
the aggregated representation of the sequence  
the final representation of time slot  
the matrix that stores n nodes’ features at time slot  
,  the outbound passenger demands and the snapshot of DDW graph at time slot 
,  the predicted results of and 
,  the elements of and 
the transferring probability between node and  
loss functions  
the weights of loss functions  
the weights to learn  
the dimensions of feature vectors during , embedding layers and feature vectors at time slots  
the weight matrices to learn 
In this section, we present the detail of our model Gallat, a spatiotemporal attention network for passenger demand prediction. Figure 1 depicts the overview of our proposed model. With the DDW graph sequence and the geographical relationship R
as the model input, the feature extraction module generates the feature vector
for node at the th time slot. Then, we define a spatial attention layer to learn a spatial representation (i.e., embedding) for every node at time by aggregating information from nodes within three distinct types of spatial neighborhoods. Afterwards, all nodes’ representations will be fed into a temporal attention layer, which updates its current embedding with the captured temporal dependencies among its historical embeddings. In our transferring attention layer, we first calculate the total number of outbound passenger demands departing from each region and the transferring probability between every pair of nodes with the current embeddings, and then use the resulted probabilities to map each region’s total passenger demands to the corresponding destination regions, which will compose the information in the next DDW snapshot graph .Given a region node , we first construct its feature vector by merging relevant information from multiple sources. Specifically, the feature vector for node at time is the concatenation of feature embeddings from all feature fields (e.g., weather, day of the week, etc.):
(1) 
where is the total number of feature fields, and denotes the concatenation operation. Note that can be either a dense embedding vector for categorical features (e.g., node ID) or a realvalued number for continuous features (e.g., temperature). As such, for each snapshot DDW graph , we can obtain the features of all nodes to support subsequent computations. In our experiments, a node ’s feature vector is the concatenation of its row, column, out degree, in degree in the graph , as well as the embeddings of the node ID, time slot and the corresponding day of a week.
In this section, inspired by the inductive graph representation learning approach (Hamilton et al., 2017), we learn a latent representation for node at time by effectively aggregating the features of its neighbors. Different from (Hamilton et al., 2017) which only focuses on a single type of neighbors, we define three types of neighbors in DDW graphs, namely forward neighbors and backward neighbors based on passenger mobility, and geographical neighbors based on physical locations. We make the statistics of realworld data in Beijing during one month. As shown in Figure 2, we summarize the mean values of a nonpeak time slot (3:00am) and a peak one (9:00am). It can be seen that the forward neighbors and backward neighbors exhibit different distributions in the same time slot and the same type of neighbors also show different distributions at different time slots. Hence, it is meaningful to distinguish the forward and backward neighbors. We first give the definition of forward and backward neighborhoods for each node as follows.
If there is at least one demand starting from region node and ending at , i.e., , then is a forward neighbor of . For node , its forward neighborhood at an arbitrary time slot is a set of node indexes defined as:
(2) 
Similarly, if there is at least one demand starting from region node and ending at , then is a backward neighbor of . For node , at an arbitrary time slot , we can obtain a set of its backward neighbors’ indexes via:
(3) 
According to Eq.(2) and Eq.(3), it is worth mentioning that the numbers of different nodes’ forward and backward neighbors are asymmetrical and timedependent. The rationale of defining forward and backward neighborhoods is that the characteristics of each region are not only determined by its intrinsic features, but also affected by its interactions with other regions. Intuitively, if more trip orders are observed between nodes and at time , they are more likely to possess a higher semantic affinity in that time slot. Hence, by propagating the representations of node ’s neighbors to itself, the properties from closely related nodes can complement its original representation, thus producing an updated embedding of node with enriched contexts. Moreover, when modelling passenger demands as DDW graphs, the direction of an edge carries crucial information about different mobility patterns, and indicates varied functionalities of regions. For example, it is common to see a residential area has many people travelling to the central business district for work during the morning rush hours; while in the evening when people return home from work, a large passenger flow may be observed in the reverse direction. Hence, for two linked nodes, unlike (Wang et al., 2019b) that treats the passenger demands from both directions equally, we define the forward and backward neighbors to distinguish their semantics and allow them to contribute differently to the resulted node embedding.
Apart from that, we also capture the information from geographically connected regions in the spatial attention layer. Geographical neighbors of region node are nodes that are physically close to it, which are constant across time slots.
As Definition 2.2 states, the nodes in our DDW graphs are manuallydivided and nonoverlapping grids with all pairwise geographical relationship R. For a node , its static geographical neighborhood is formulated as:
(4) 
where is a threshold of the distance to determine the size of ’s neighborhood. A node’s geographical neighbors are important for its embedding. Intuitively, being the geographical neighbors of , the nodes in are more likely to own similar inner properties as , thus leading to close distributions of passenger demands. For example, if node and are adjacent and located in a sparsely populated suburb, then both of them are likely to have fewer demands. The geographical neighbors also help alleviate the data sparsity problem. For example, if a node has very few forward and backward neighbors at a specific time , the features of its geographical neighbors will become a key supplementary information resource to ensure the learning of discriminative node representation.
It is worth mentioning that these three kinds of neighborhoods carry different information from both semantic and geographical perspectives. They need to be transfered to the next step of the model separately so we utilize the attentionbased aggregator to gather information within each kinds of neighborhood and concatenate the aggregating results of them in the following steps.
With three types of node neighborhoods defined, we devise an attentionbased aggregator to merge the respective node information within , and . Taking the feature vectors of all nodes in the neighbor sets, i.e., , and as the input, the attentionbased aggregator fuses them into a unified vector representation for each node at time . Before detailing our attentionbased aggregator for node embedding, we first briefly introduce the naive form of aggregator introduced in (Hamilton et al., 2017), which is originally designed for static graphs without any direction or weight information on edges:
(5) 
where represents the resulted embedding vectors of node , W is the weight matrix to learn, and is the nonlinear activation function. To aggregate the information passed from neighbor nodes using the aggregation function , the GraphSAGE proposed in (Hamilton et al., 2017) creates a paradigm that allows the model to sample and aggregate a fixed number of neighbors . However, GraphSAGE uniformly samples a relatively small number of neighbors, which treats all neighbor nodes evenly and neglects their varied importance of a node’s neighbors. Also, for a popular region, only sampling a small subset of its neighbors will lead to severe information loss. Consequently, the learned embedding of will likely fail to pay sufficient attention to closely related neighbors, and be vulnerable to the noise from semantically irrelevant nodes. In addition, as (Hamilton et al., 2017) simply assumes there is only one homogeneous type of neighborhood relationship in a graph (i.e., ), Eq.(5) is unable to simultaneously handle information within the heterogeneous neighbors we have defined for passenger demand prediction. In light of this, we propose an attentionbased aggregation scheme to discriminatively select important nodes via learned attentive weights, which also extends Eq.(5) to our three heterogeneous neighborhood sets:
(6) 
where is a shared weight matrix that projects all feature vectors onto the same dimensional embedding space. Notably, the specific function we adopt is the weighted sums of node information within , , and , respectively. , and are the attentive weights between nodes and in the corresponding neighborhoods. In this attention layer, we focus on mining the finegrained pairwise importance from the neighbor node to the target node. Hence, we employ the selfattention calculation of GAT (Veličković et al., 2017) in Eq.(7) which is designed for graph node representation learning and would learn a more expressive representation for each node. To compute the attentive weights, we firstly define a shared attention network denoted by . produces a scalar that quantifies the semantic affinity between nodes and using their features. Take one arbitrary node pair at time as an example, then is calculated as follows:
(7) 
where represents the function that applies nonlinearity, is the learnable weight matrix, is the project weight that maps the concatenated vector to a scalar output. Then, we compute , and by applying (Mnih and Hinton, 2009) to normalize all the attention scores between and its forward, backward, and geographical neighbors:
(8) 
which enforces
and thus can be viewed as three probability distributions over the corresponding type of neighborhoods. The attention network is shared across the computations for all three neighborhoods. In Eq.(
8), it is worth noting that, before being fed into the attention network, every neighbor node of is weighted by a factor , and for the forward, backward, and geographical neighborhood, respectively. Next, we explain the rationale of involving these weights in the computation of Eq.(8) and present the details of three preweighted functions for generating the weights , and .The core idea behind our preweighted functions is to timely help sense the sparsity of the data and provide additional prior knowledge for the subsequent attentionbased information aggregation. This is achieved by taking advantage of the observed weights on each DDW graph and the geographical relationship R. Given the target node , for any of its forward, backward or geographical neighbor in , or , we derive three statisticsdriven preweighted functions to compute the corresponding weight for , i.e., , or :
(9) 
where is a small additive term in case the denominator is (i.e., or in highly sparse data). As suggested by Eq.(9), the weights and reflect ’s intensity of passenger demands at time . Therefore, the attention weights and obtained in Eq.(8) is not only dependent on the semantic similarity between node features and , but also the realtime popularity of neighbor region node at time
. Besides, motivated by Harmonic mean, the geographical weighting factor
essentially assigns larger weights to region nodes that are geographically closer to the target node . As such, by coupling the preweighted functions with our attentionbased aggregator, the embedding generated with Eq.(6) is an expressive blend of its inner properties and the characteristics of three distinct neighborhoods.So far, we can obtain a set of embeddings for all regions in each snapshot DDW graph . Specifically, for each time slot , we use a feature matrix to vertically stack all node embeddings at time . Hence, for the DDW graph sequence , we can obtain timevarying feature matrices to represent the DDW graphs at corresponding time slots. For the current time slot , since only carries the spatial information within the th DDW graph. To account for the dynamics within our DDW graphs, we develop a temporal attention layer to firstly capture the sequential dependencies among the learned representations in , and generate a spatiotemporal representation for predicting the passenger demand in the next time slot. Obviously, a straightforward approach is to gather information from most recent and consecutive DDW graphs, i.e., . However, in reallife scenarios, for the DDW graph at time , only DDW graphs from time slots that are temporally close to will exploit similar characteristics. In contrast, if there is a relatively big time gap between two DDW graphs, their characteristics will vary significantly, e.g., the traffic flow in the central business district will be much lower during midnight than in the morning. As a result, merely using consecutive time slots can introduce a large amount of noise when learning the spatiotemporal DDW graph representation .
On this occasion, as Figure 4 shows, we design a multichannel structure to capture the temporal patterns among different DDW graphs. To enhance the capability of learning useful information from historical DDW graphs, we infuse the periodicity within passenger demands into the temporal attention layer. Specifically, apart from the DDW graph sequence , we derive three periodical sequences to augment the longterm temporal information about the DDW graph at time . First, we collect historical DDW graphs from the same time slot of each day. For example, if we divide each day into 24 1hour slots, then we can collect DDW graphs from the same 8:00–8:59 am slot from consecutive days. Mathematically, we represent such sequence as , where is the number of time slots in a day ( in our case), and . Similarly, to leverage the close contexts in directly adjacent time slots, we consider two periodical sequences for ’s prior and subsequent time slots (i.e., and ), which results in and , respectively. The nonperiodical sequence is also used to capture shortterm passenger demand fluctuations.
Correspondingly, we build our temporal attention layer with four channels to attentively aggregate the information within sequences , , , and :
(10) 
where represents the rowwise function. , and are query, key and value weight matrices dedicated each channel . is the feature matrix that stores nodes’ features at time . Each feature is generated with the same process as in Eq.(1). Note that because some features for time slot might be unavailable at the current time (e.g., the number of trip orders), and are hence excluded. Specifically, we formulate Eq.(10) with the notion of scaled dot product attention (Vaswani et al., 2017) which is different from the selfattention in Eq.(7). In this attention layer, we are more concerned on capturing the graphlevel association in the time domain. To more efficiently and effectively compute an attentive feature matrix for the current graph, we adopt the scaled dotproduct attention in Eq.(10). The rationale is that, the rowwise first produces an attention matrix A, where the th row is a probability distribution indicating the affinity between region at and each region at . Then, by multiplying A with the projected feature matrix , we can obtain an updated representation for each time slot by selectively focusing on regions that are more similar to the contexts of region at . By taking the sum of representations for all time slots, Eq.(10) generates a temporal representation for each channel, denoted by , , and .
After obtaining the channelwise representations, we merge all information into a unified spatiotemporal representation by sharing another selfattention unit across all four channels:
(11) 
with weights and to learn. As such, the resulted matrix is the final representation learned from all DDW graphs in . Essentially, now encodes both the spatial and temporal information contained in DDW graphs up to time , which can provide strong predictive signals for estimating the upcoming passenger demands.
With the spatiotemporal representation , we deploy a feedforward layer to firstly compute an dimensional vector , where each element represents the total amount of outbound passenger demands (i.e., the outdegree of node ) at the next time slot :
(12) 
with weight and bias to learn. Intuitively, we derive to firstly capture the general trend and intensity of trip demands in each region, then distribute the total demands to different destinations in a finegrained way. Meanwhile, it can also support traditional destinationunaware passenger demand prediction tasks (Tong et al., 2017; Wang et al., 2019a). We denote this task as Demand task in this paper. In the following experiment, we conduct pretraining on this task before the formal training process so that we can obtain the accuracy results more quickly.
To map the total passenger demands from region node to all nodes, we calculate a transferring probability distribution to indicate the likelihood of observing a passenger trip from to each destination region at the next time slot . Specifically, as the th row is a row vector carrying the spatiotemporal representation of node , we calculate each probability via the following attention mechanism:
(13) 
where the has the same structure as in Eq.(8) but uses a different set of parameters and . Finally, we can estimate every element in the next DDW graph (which is denoted as the OD task) :
(14) 
Note that we only consider the start time of the passenger demand, that is, how many trip requests will generate between two nodes at the time slot no matter whether the trips will be finished in .
We formulate the overall loss function as follows:
(15) 
where (Ross, 2015) is a variant of the meansquarederror loss which uses a squared term if the absolute elementwise error falls below and an term otherwise; ,
are two hyperparameters balancing the importance of two tasks. The motivation of defining
is to push our Gallat model to generate accurate predictions on both the overall demands () and the origindestination demands (). In addition, we conduct pretraining for our model on the Demand task’s loss function firstly. Then with the relative accurate prediction of , we train the model for the further prediction of(OD task) based on the pretrained model. All parameters are optimized with the Stochastic Gradient Descent (SGD) method. Specifically, we use Adam
(Kingma and Ba, 2014), a variant of SGD to optimize the parameters in our model.In this section, we analyze both the time and space complexity of Gallat.
Putting away the convenient concatenation operation and weighted sum, the major computational cost of Gallat comes from the attention mechanisms used in our spatial, temporal, and transferring layers. For every node feature pair , it takes time to compute a scalar attention score in Eq.(7). As there are possible combinations for nodes, the total time complexity is for the computation in Eq.(8). At the same time, the selfattention modules in the temporal attention layer (i.e., Eq.(10) and Eq.(11)) consumes time to calculate. Similar to the spatial attention layer, Eq.(13) has the time complexity of . As , , and are typically small (see Section 4), the predominant factor in Gallat’s time complexity is the total number of regions . Also, as , , and are fixed in our model, the time complexity of Gallat is linearly associated with the scale of the data.
The trainable parameters from the spatial, temporal and transferring attention layers are , , and , respectively. This results in a total parameter size of . Hence, the dominating term in the parameter size of Gallat is , which has the space complexity of .
In this section, we conduct experiments on realworld datasets to showcase the advantages of Gallat in passenger demand prediction tasks. In particular, we aim to answer the following research questions via the experiments:
How effectively does Gallat work on passenger demand prediction tasks?
How does Gallat benefit from each component of the proposed model structure?
How do the major hyperparameters affect the prediction performance of Gallat?
What about the scalability of Gallat with the number of time slots/grids increasing?
What’s the effectiveness of the pretraining process on Demand task?
Can Gallat learn useful mobility patterns from real data?
dataset  Beijng  Shanghai 
time span  4 months  4 months 
total area  km  km 
grid granularity  km  km 
time slot granularity  1 hour  1 hour 
We conduct experiments on two realworld datasets generated by Didi, which are both desensitized. Some similar datasets are publicly available^{1}^{1}1https://outreach.didichuxing.com/research/opendata/. Table 2 summarizes the characteristics of two datasets. The first dataset is collected in Beijing covering the area within 6th Ring Road. The second dataset covers the urban area of Shanghai. Both datasets are collected from June to September in 2019. We divide Beijing and Shanghai into 400 grids based on the granularities shown in Table 2 as the average time for a car to travel such distance is 5 minutes, which is a reasonable waiting time for passengers (Wei et al., 2016). The DDW graphs on both datasets are constructed with 1hour granularity. As the prediction results for passenger demands are mainly used as a reference for vehicle dispatching, so onehour granularity can provide enough time for having the dispatching strategy operated in advance.
To evaluate the performance of our model Gallat on demand prediction, we compare with the following baseline methods.
HA: We adopt History Average to passenger demand prediction by calculating the mean of historical data at the same time slot of days and the same day of weeks.
LSTNet: LSTNet (Lai et al., 2018) is a stateoftheart time series prediction model, which combines both LSTM and CNN for spatiotemporal feature modelling.
GCRN: The recently proposed GCRN (Seo et al., 2018) combines GCN with RNN to jointly identify spatial correlations and dynamic patterns.
GEML: GEML (Wang et al., 2019b) employs graph embedding in the spatial perspective and a LSTMbased multitask Learning architecture in the temporal perspective to predict the passenger demands from one region to another.
We conduct experiments on passenger demand prediction settings, i.e., origindestination demand prediction (denoted by “OD”) and originonly demand prediction (denoted by “Demand”), which corresponds to our model output of and , respectively. We measure the prediction accuracy with Mean Absolute Percentage Error (MAPE) and Mean Absolute Error (MAE) which have been widely used to evaluate the model performance in regression tasks:
(16) 
where is the total number of instances, represents the predicted result and represents the ground truth. In reallife applications, ridehailing platforms concern more about the areas with more passenger demands. Regions having almost no passenger demand are less profitable and are paid less attention to. As a result, they only calculate the metrics for those records whose values are above some thresholds. In our experiments, we select three thresholds i.e., 0, 3, and 5 to calculate the metrics results which are termed as MAPE0, MAPE3, MAPE5 and MAE0, MAE3, MAE5, respectively. These are also used to evaluate the demand prediction on Didi platform now.
Task  Method  Beijing  
MAPE0  MAPE3  MAPE5  MAE0  MAE3  MAE5  
OD  HA  2.7454  3.0059  3.1332  13.3953  39.7657  54.0634 
LSTNet  2.9443  4.3750  6.7874  14.8374  42.4918  95.2971  
GCRN  0.8347  0.9549  0.9693  5.0278  15.7346  23.3819  
GEML  0.8736  0.9244  0.9832  5.4396  12.6831  24.9918  
Gallat  0.7283  0.8465  0.8896  2.6781  3.4139  6.2347  
Demand  HA  4.1315  3.7744  3.7077  512.7970  564.1135  593.6073 
LSTNet  12.5007  7.0283  5.3786  1329.3238  1589.1176  1608.4531  
GCRN  10.1060  4.1456  3.1136  127.3906  134.4935  148.2324  
GEML  0.8710  0.7671  0.7232  24.7852  29.3469  32.5521  
Gallat  0.6902  0.3904  0.3613  17.8332  20.0048  23.9004 
Task  Method  Shanghai  
MAPE0  MAPE3  MAPE5  MAE0  MAE3  MAE5  
OD  HA  2.6416  2.8522  2.9670  18.5351  53.3798  71.8696 
LSTNet  3.8422  4.6471  7.7677  28.1293  87.2351  112.3874  
GCRN  0.8714  0.9678  0.9783  4.1634  12.2219  16.4002  
GEML  0.8922  0.9210  0.9792  6.0167  12.3469  15.9975  
Gallat  0.6813  0.8752  0.9143  3.6138  5.3497  8.6622  
Demand  HA  3.7065  3.4005  3.3711  414.8688  489.3267  522.7588 
LSTNet  15.9938  5.0623  4.6083  625.9973  688.1231  745.8227  
GCRN  8.1710  3.3146  2.4972  143.7841  149.9170  153.6679  
GEML  1.0220  0.7297  0.6832  37.4469  41.0042  49.6711  
Gallat  0.6899  0.4109  0.3815  21.3526  24.7359  29.1875 
In experiments, we leave out the last two weeks of each dataset as the test set and the rest is training set. The last 10% of the training set is used for validation. We implement Gallat with Pytorch 1.5.0 on Python 3.8. When training Gallat, we use the Demand task to do pretraining first, which can relieve the data sparsity problem and speed the training process. The default values of batch size, epoches, embedding dimension
, historical time slots , and loss weights are set as , , , and .Table 3 and Table 4 shows the results of stateoftheart methods and Gallat under MAPE and MAE above thresholds on the test set. For the comparison with other models, we make the following observations from these two tables:
It clearly shows that the results of LSTNet and GCRN on Demand task are even worse than HA while they are doing better on OD task. GEML and Gallat show stable performance on both tasks. It may be because LSTNet and GCRN just focus on single task in their model structure and GEML and Gallat both involve the two tasks in their design.
Methods tailored for graphstructured data (GCRN, GEML, Gallat) achieve better overall performance on OD task. It might prove that it’s a better choice to model the passenger mobility prediction as a graphbased problem so that the complicated interactions between nodes can be fully captured. And in all of them, Gallat takes a more sufficient consideration on the graph representation, which tends to be the main reason that Gallat outperforms other spatiotemporal models.
On Demand task, as the demand threshold increases the MAPE on Demand task is decreasing while MAE keeps increasing. One possible reason is that, the scale of the prediction targets in Demand task is larger than that in the OD task. Then, when we enlarge the threshold, the ground truth, i.e., the dominator of MAPE in Eq.(16) is increasing more significantly than the absolute error in the numerator. Furthermore, Gallat’s advantage against baselines is larger on MAPE5, which demonstrates that our model is highly accurate in prediction the passenger flow for popular regions.
Task  Method  Beijing  
MAPE0  MAPE3  MAPE5  MAE0  MAE3  MAE5  
OD  GallatS1  0.7728  0.9082  0.9347  3.8855  11.6697  15.7926 
GallatS2  0.7497  0.8689  0.9050  3.4523  10.2289  13.8688  
GallatS3  0.9534  0.8689  0.9049  3.5206  10.4229  14.1817  
GallatS4  0.7421  0.8846  0.9135  3.4258  10.2555  13.8887  
GallatS5  0.7598  0.7712  0.9462  3.3208  8.7657  11.6541  
Gallat  0.7283  0.8465  0.8896  2.6781  3.4139  6.2347  
Demand  GallatS1  2.2931  1.1786  0.9723  52.5624  55.6956  57.5813 
GallatS2  0.7637  0.4291  0.3641  25.0391  26.4881  27.4193  
GallatS3  1.0001  0.5030  0.4109  26.5299  27.7005  28.4539  
GallatS4  1.0748  0.4858  0.3925  24.7973  25.8946  26.6563  
GallatS5  0.9714  0.4730  0.3810  24.3334  25.4282  26.1744  
Gallat  0.6902  0.3904  0.3613  17.8332  20.0048  23.9004 
Task  Method  Shanghai  
MAPE0  MAPE3  MAPE5  MAE0  MAE3  MAE5  
OD  GallatS1  0.8635  0.8945  0.9194  5.6163  16.0487  21.5877 
GallatS2  0.8458  0.8843  0.9093  5.1938  14.6175  19.5689  
GallatS3  0.8673  0.9164  1.1901  5.3672  15.0821  20.1670  
GallatS4  0.8450  0.8841  0.9091  5.1041  14.3365  19.2251  
GallatS5  0.8338  0.8780  0.9056  5.1784  12.8859  16.5776  
Gallat  0.6813  0.8752  0.9143  3.6138  5.3497  8.6622  
Demand  GallatS1  4.8831  1.5453  1.1307  64.4367  70.1194  72.9258 
GallatS2  1.0183  0.4603  0.3899  27.3659  30.2024  31.4659  
GallatS3  1.3682  0.5138  0.4076  27.7591  31.3426  32.9075  
GallatS4  1.0572  0.4402  0.3614  23.3665  26.2928  27.6677  
GallatS5  0.9371  0.4248  0.3605  23.3078  26.0122  27.2486  
Gallat  0.6899  0.4109  0.3815  21.3526  24.7359  29.1875 
To validate the performance gain from each component of our model, we conduct an ablation study in which we change one component from Gallat each time to form a variant model. We implement the following variants of Gallat:
GallatS1: We use the existing method Graph Attention Networks (GATs) (Veličković et al., 2017) to replace the spatial attention layer, which doesn’t distinguish the forward and backward neighbors and ignores the geographical neighbors.
GallatS2: We treat forward and backward neighbors as semantic neighbors like (Wang et al., 2019b) in the spatial attention layer.
GallatS3: We replace the attentionbased aggregator in the spatial attention layer with the default mean aggregator as used in (Hamilton et al., 2017).
GallatS4: We use a mean aggregator to replace the dot product attention in the temporal attention layer.
GallatS5: We replace the transferring attention layer with a simple dense layer.
By comparing with the different variants of Gallat, a few observations can be obtained from Table 5 and Table 6:
It is obvious that GallatS1 has the worst overall performance especially on the Demand task, which may indicates that the whole design of our spatial attention layer play an important role in the overall performance of our model. And then the second worst is GallatS3, which means our attentionbased aggregator in the spatial attention layer leads to significantly better effectiveness than the simple mean aggregator.
The overall performance of GallatS2 is the better than GallatS3, especially on Shanghai dataset, indicating that the separation of forward and backward neighbor aggregation is necessary for providing more contexts for the prediction.
The performance of GallatS4 is as inferior as GallatS2. Hence, we can tell that the attention mechanisms in both the temporal attention layer and transferring attention layer can help the Gallat selectively learn useful patterns for passenger demand prediction.
GallatS5 shows good results out of ones which may indicate the traditional dense layer has some advantages in learning transferring probabilities. However, GallatS5 is very unstable compared with other methods.
In this section, we discuss three important hyperparameters, that is, weights of task losses, i.e. and , and the number of historical time slots in each channel. As hyperparameters are closely related to the performance of the model, we conduct experiments by varying their settings and record the new prediction results achieved. In what follows, we discuss the impact of these hyperparameters.
As shown in Figure 5, we adjust the values of the loss weight pair in both OD and Demand tasks. From the Figure, we can observe that:
Under the same loss weights, the results on different datasets show different patterns. On Beijing dataset, the model gains better overall performance with . On Shanghai dataset, the weights around () shows more advantages. We suppose this is related to the different intrinsic features of two datasets like the scale and sparsity of the data. For instance, the sparsity of Beijing dataset is more severe than Shanghai’s, and then the accurate prediction on OD passenger demands relies more on the accuracy of Demand task, so the model needs a relatively larger .
The Demand task is more sensitive to the loss weights than the OD Matrix task, especially on Shanghai dataset, which is possibly caused by the different demand distributions in different datasets.
Figure 6 depicts the variation trend of model’s performance under different numbers of historical time slots considered in each channel. Our findings are as follows:
The results show that the performance of the model does not always improve with an increasing value of . Also, we can see an obvious trend from Figure 6 that when we set based on a periodic value like the number of days in a week (i.e., , , etc.), the model shows a better performance overall.
In order to testify the model’s scalability and the effectiveness of pretraining, we conduct several groups of experiments on both datasets and the results are shown in Figure 7. In each group of experiments, except the variable parameter, the other parameters are set as default values following Section 4.3.
Figure 7 and 7 describe the training time cost. In the first group of experiment, we set the number of grids as while in the second one, we utilize months’ data on both datasets. The summary of this part is as following:
It is obvious that as the number of the time slots and grids increase, the time cost of the model training is almost growing linearly, especially the time cost with the growing number of time slots.
The time cost curve with increasing grids is less strictly linear. The possible reason is that the number of grids affects not only the data size but also the number of different neighbors in the spatial part which brings more complex impact to the training time than simple increasing the data size .
Figure 7 and 7 show the validation loss curve under the conditions of withpretraining and withoutpretraining. We can see from the figures:
The start loss of training process with pretraining is much lower than that without pretraining which means that the pretraining process would help fit the model on both tasks better.
On both datasets,we can see that the pretraining offers faster convergence and a lower final loss, indicating that it is beneficial for our model.
In order to better understand the latent patterns learned by Gallat, we use Figure 8 to visualize a part of passengers’ demand patterns of the most popular regions in Shanghai predicted by Gallat during three different time slots. The red circle marks a region node on the map with its ID. The red rectangles and the arrows illustrate how many passengers transfer from one node to another. The first value in the rectangle is the predicted result of our model while the one in the bracket is the corresponding ground truth. From Figure 8, we can draw the following observations:
Figure 8(a) depicts the transferring relationship centered on node at 8:00 am. As node is a major residential area, it is the starting location for many people to leave home and march to several business districts and entertainment places, such as node and , which cover a business district and the Shanghai Century Park, respectively.
Figure 8(b) demonstrates the passenger mobility centered on node which is a famous business district called Shanghai Xin Tian Di. Around 6:00 pm, a large amount of passengers leave this area to other residential or nightlife districts such as node and node . There are also passengers going to node , which has a large train station and an airport.
Figure 8(c) illustrates the transferring relationship centered on node at 9:00 pm. Compared with Figure 8(a) and 8(b), it clearly shows that there are both passengers going to and leaving node . It may be because also contains many nightlife venues like bars. There are also passengers travelling to node to take night trains or flights.
This section introduces the stateoftheart studies related to our problem. We classify them into three categories.
Basically, our problem is a sequence based prediction problem. There are already many existing studies in this field (Tong et al., 2017; Wang et al., 2017, 2019a; Wei et al., 2016; Yao et al., 2018; Hulot et al., 2018; Liu et al., 2017; Lai et al., 2018; Xu et al., 2020; Xia et al., 2018; Guo et al., 2019; Chen et al., 2020; Su et al., 2020) which provide much inspiration for us. As the development of ridehailing applications, we can collect more accuracy passenger demand data instead of traditional trajectory data from taxis to do research. Some studies (Tong et al., 2017; Wang et al., 2017, 2019a; Wei et al., 2016; Yao et al., 2018; Hulot et al., 2018; Liu et al., 2017; Xu et al., 2020) have focused on passenger demand prediction via these kinds of data.
Tong et al. (2017)
put forward a unified linear regression model based on a distributed framework with more than 200 million features extracted from multisource data to predict the passenger demand for each POI.
Yao et al. (2018) propose a deep multiview spatialtemporal network framework to model both spatial and temporal relations. Specifically, their model consists of three views: temporal view (modelling correlations between future demand values with near time points via LSTM), spatial view (modelling local spatial correlation via local CNN), and semantic view (modelling correlations among regions sharing similar temporal patterns). Wang and Wei et al. (Wang et al., 2019a; Wei et al., 2016) present a combined model to predict the passenger demand in a region at the future time slot and the model catches the temporal trend with a novel parameter and fuses the spatial and other related features by ANN based on multisource data. Lai et al. (2018)propose a novel framework, namely Long and Shortterm Timeseries network (LSTNet) which uses the Convolution Neural Network (CNN) and the Recurrent Neural Network (RNN) to extract shortterm local dependency patterns among variables and to discover longterm patterns for time series trends. There are some other works
(Hulot et al., 2018; Liu et al., 2017) based on bike sharing system. Hulot et al. (2018) focus on predicting the hourly demand for demand rentals and returns at each station of the bike sharing system and then they focus on determining decision intervals which are often used by bike sharing companies for their online rebalancing operations. Liu et al. (2017) develop a hierarchical station bike demand predictor which analyzes bike demands from functional zone level to station level. What’s more, Zhang et al. (2017, 2020) design different models to predict the inflow and outflow of people in a given area, but still they ignore the transferring relationship between different areas.All aforementioned methods have their own advantages but they model prediction problems as time series prediction problems and ignore the intrinsic connection and mobility between different areas or traffic interactions. Meanwhile, most of them need to draw support from sufficient multisource data, which makes their model not that general and in low reproducibility.
As traffic is based on networks consisting of lines and nodes, the traffic prediction can be naturally modeled into graph problems. Then graphbased methods (Deng et al., 2016; Hu et al., 2020; Gong et al., 2018; Li et al., 2018a; Liu et al., 2019; Shi et al., 2020; Wang et al., 2019b; Geng et al., 2019; Seo et al., 2018; Jiang et al., 2018; Su et al., 2019; Yu et al., 2018; Li et al., 2018b; Jin et al., 2020; Liang and Zhang, 2020; Wang et al., 2021) can be used to solve them.
Deng et al. (2016) define the traffic prediction based on the road network and given a series of road network snapshots, they propose a latent space model for road networks to learn the attributes of nodes in latent spaces which captures both topological and temporal properties. Li et al. (2018a) propose to model the traffic flow as a diffusion process on a directed graph and introduce diffusion convolutional recurrent neural network for traffic forecasting that incorporates both spatial and temporal dependency in the traffic flow. Wang et al. (2019b) design a gridembedding based multitask learning model where gridembedding is designed to model the spatial relationships of different areas, and LSTM based multitask learning focuses on modelling temporal attributes and alleviating the data sparsity through subtasks. Geng et al. (2019) present the spatialtemporal multigraph convolution network where they first encode the nonEuclidean pairwise correlations among regions into multiple graphs and then explicitly model these correlations using multigraph convolution. To utilize the global contextual information in modelling the temporal correlation, they further propose contextual gated recurrent neural network which augments recurrent neural network with a contextualaware gating mechanism to reweight different historical observations. Seo et al. (2018)
propose the Graph Convolutional Recurrent Network (GCRN), a deep learning model able to predict structured sequences of data. Precisely, GCRN is a generalization of classical recurrent neural networks (RNN) to data structured by an arbitrary graph.
Jiang et al. (2018) build an online system via RNN to conduct the next shortterm mobility predictions by using currently observed human mobility data. Yu et al. (2018) propose a novel deep learning framework, spatiotemporal graph convolutional networks, to tackle the time series prediction problem in traffic domain. Instead of applying regular convolutional and recurrent units, they formulate the problem on graphs and build the model with complete convolutional structures, which enables much faster training speed with fewer parameters.These works have either of the following two problems. Firstly, although they define the traffic prediction as a graphbased problem, they fail to provide a good representation of the graph, and in other word, they don’t fully use the attributes of the graph, i.e., dynamics, direction and weight. Secondly, most of them utilize the traditional transductive graph convolution, but it is not friendly to coldstart nodes, that is, nodes have no interaction with others.
Recently, the development of graph representation learning methods (Perozzi et al., 2014; Tang et al., 2015; Hamilton et al., 2017; Veličković et al., 2017; Yin et al., 2017; Trivedi et al., 2019; Zhou et al., 2018; Zhang et al., 2018; Goyal et al., 2018; Ma et al., 2018; Zhao et al., 2018; Yin et al., 2019; Chen et al., 2019; Li et al., 2020; Liu et al., 2020; Zheng et al., 2020) offer us some new thoughts to solve graphbased problems.
The wellknown methods Deepwalk (Perozzi et al., 2014), Line (Tang et al., 2015) and node2vec (Grover and Leskovec, 2016) are both transductive and focused on static graphs. Hamilton et al. (2017) propose an inductive methods to learn representation on large graphs, which solves the cold start problem but it still only focuses on spatial perspective and ignores the dynamic attribute of graphs. Veličković et al. (2017) employ attention mechanism in the node embedding process which gives us many inspirations but unfortunately it is also based on static graphs. Trivedi et al. (2019) build a novel modelling framework for dynamic graphs that posits representation learning as a latent mediation process bridging two observed processes namely dynamics of the network and dynamics on the networks. This model is further parameterized by a temporalattentive representation network that encodes temporally evolving structural information into node representations, which in turn drives the nonlinear evolution of the observed graph dynamics. Zhang et al. (2018) propose a new network architecture, gated attention networks (GaAN). Unlike the traditional multihead attention mechanism, which equally consumes all attention heads, GaAN uses a convolutional subnetwork to control each attention head’s importance. These two studies above both propose novel frameworks over dynamic graphs but they don’t support the directed and weighted attributes.
Goyal et al. (2018) present an efficient algorithm DynGEM based on recent advances in deep autoencoders for graph embeddings. The major advantages of DynGEM include: the embedding is stable over time; it can handle growing dynamic graphs; and it has better running time than using static embedding methods on each snapshot of a dynamic graph. Ma et al. (2018) propose a Deeply Transformed Highorder Laplacian Gaussian Process (DepthLGP) method to infer embeddings for outofsample nodes in dynamic graphs. DepthLGP combines the strength of nonparametric probabilistic modelling and deep learning. However, these two works only focus on the representation learning over dynamic and weighted graphs but they neglect the directions of edges.
In this paper, we define the passenger demand prediction problem in a new perspective, which is based on dynamic, directed and weighted graphs. To tackle this problem, we design a spatialtemporal attention network, Gallat, which includes spatial attention layer, temporal attention layer and transferring attention layer. In the spatial perspective, the spatial attention layer mimics the message passing process to formulate representation learning for each node by aggregating the information of all its neighbors. In this process, we add preweighted functions for three kinds of neighbors and utilize the attention mechanism to calculate the importance of different neighbors. In the temporal perspective, the temporal attention layer combines the learned representation of historical time slots and different channels via seflattention. Finally, we predict the passenger demand within an area first, and learn a transferring probability via attention mechanism to obtain the final passenger mobility in the future time slots. We conduct extensive experiments to evaluate our model, which demonstrates that our model significantly outperforms all the baselines.
Recurrent Event Network: Autoregressive Structure Inference over Temporal Knowledge Graphs. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
.
Comments
There are no comments yet.