I Introduction
Intelligent Transportation System (ITS) aims to explore better transportation options for human beings and better relationships among users, vehicles and transportation infrastructures [6, 7]. Nowadays, with massive spatiotemporal data, artificial intelligence plays more and more important role in ITS by leveraging datadriven methods to analyze the traffic patterns, and has obtained promising results in many tasks of ITS [31, 26, 9].
Estimated Time of Arrival (ETA) is one of the most fundamental and challenging problems in ITS. It is considered as predicting the travel time from an origin location to a destination location along a given route. An ETA model enables the transportation system to efficiently schedule the vehicles to control the increasing urban traffic congestion [5]. Due to the rapid growth of ridehailing apps such as Uber and DiDi, ETA has attracted more and more attention in recent years. An accurate ETA system can significantly improve the operating efficiency of the ridehailing platforms by influencing route planning, navigation, carpooling, vehicle dispatching and scheduling. The left part of Fig. 1 show a real case of ETA.
Existing ETA methods can be divided into two categories. The fist one is the additive methods that explicitly predict the travel time for each road segment and give the total travel time of a route by assembling the ingredients’ travel time. These methods have intuitive interpretability, but the prediction may be inaccurate when local errors are accumulated. The other one is the overall methods that directly predict the overall travel time of the route, by formulating ETA as a regression problem. For example, the WideDeepRecurrent model (WDR) [26]
takes neural network to predict the travel time based on a rich set of input features. This kind of methods avoid the local error accumulation but have relatively weak interpretability because of using blackbox model.
We refer to the road segments as links in the remaining part of this paper. The technique of embedding [1, 18, 19] is widely used, especially in deep learning ETA models, to capture the spatiotemporal patterns of link as it is one of the most fundamental element in the road network. Each link is represented by an embedding vector which encodes the link’s semantic information through sufficient iterations during the training process. Though the ridehailing platforms collect millions of trajectories per day, the embedding vectors still suffers from the data sparsity problem of road network that many links are traversed by too few floating cars. For cold links, which are covered by few trajectories, the training of their embedding vectors may end in an underfitting status. Thus, the travel time estimation may have large error if a route goes through cold links.
To alleviate the data sparsity problem, we propose a novel ETA model named as RNMLETA. The model leverages multitask learning [2] and consists of a main task predicting the travel time and an auxiliary task performing the metric learning, in which the similarity between links are measured by their speed distribution. Via the metric learning, similar links get close and dissimilar links get far away in the embedded space. Thus, the embedding vectors of cold links get sufficient training, which significantly improves the ETA accuracy. Moreover, we propose a novel loss function, the triangle loss, for metric learning to take more interaction into consideration in on update. To achieve this, we switch the roles of links among the anchor, positive and negative samples. A conceptual demonstration of RNMLETA is given in Fig. 1.
The main contributions of this paper are threefold:

To our best knowledge, RNMLETA is the first deep learning method that effectively addresses the data sparsity problem of road network.

We propose a novel metric learning framework to improve the quality of link embedding vectors. The similarity of links can be measured using the speed distribution of links which can be computed from existing ETA data, requiring no extra information. We also propose the novel triangle loss to improve the learning efficiency of metric learning.

We conducted comprehensive evaluation of our method on large scale realworld datasets containing over 100 million trajectories. The experimental results validated that RNMLETA significantly improves the performance compared to a stateoftheart deep learning method.
Ii Related Work
Estimated Time of Arrival.
As one of the fundamental problems in intelligent transportation system, ETA attracts an extensive study in both academic and industrial communities. ETA models can be divided into two categories. The first category is the additive methods that explicitly estimate the travel time for each link and give the prediction of a route by assembling the ingredients’ travel time. Rulebased method can be used in the estimation of link travel time. For example, a simple rule dividing the link length by the link travel speed is widely used in the industry. Learningbased methods, such as the dynamic bayesian network
[11], gradient boosted regression tree
[29], leastsquare minimization [28]and pattern matching
[3] are also used to mine the traffic patterns and predict the link’s travel time. The data sparsity problem of road network is discussed in [25] that a part of links are traversed by too few trajectories. To alleviate the data sparseness , the authors of [25]propose to represent the trips as a tensor and utilize tensor decomposition to complete the missing values. However, dealing with data sparsity is still a challenging problem for ETA.
The second category is the overall methods that directly predict the overall travel time of the given route. Early methods such as TEMP [24] and timedependent landmark graph [27]
use traditional machine learning methods to predict the travel time. Recently, due to the bloom of deep learning
[16, 13, 15], neural network models for ETA are in a rapid development. MURAT [17]uses feedforward neural networks to predict the travel time from the origin to the destination without a given path. Multitask learning and graph embedding are used in MURAT to narrow the accuracy gap to the pathbased methods. DeepTTE
[23]proposes a geoconvolution operation to encode the coordinate information and uses recurrent neural network to learn the travel time along a GPS sequence. Since GPS sequence cannot be acquired until the trip is finished, DeepTTE resamples the GPS points by uniform distance at training stage and generates pseudo points according to a planned route at inference stage. WDR model
[26] uses a wide linear part and a deep neural network to learn the triplevel information, and a recurrent neural network to learn the finegrained sequential information in the route. The authors of [8, 14]transform the map information into the image sequence, and adopt convolutional neural network to mine spatial correlations for ETA. In these deep learning methods, the embedding of geographical elements, such as the link embedding in
[17, 26] and the grid embedding in [30], plays an important role. The embedding technique suffers from the data sparsity problem as well, because insufficient data makes the embedding vectors in an underfitting status.Metric learning. The goal of metric learning is to learn a representation function that maps objects into an embedded space. The distance in the embedded space should preserve the objects’ similarity — similar objects get close and dissimilar objects get far away. Various loss functions have been developed for metric learning. For example, the contrastive loss [4] guides the objects from the same class to be mapped to the same point and those from different classes to be mapped to different points whose distances are larger than a margin. Triplet loss [21] is also popular, which requires the distance between the anchor sample and the positive sample to be smaller than the distance between the anchor sample and the negative sample. The case with one positive sample and multiple negative samples is extended in [22]. Metric learning often suffers from slow convergence, partially because the loss only captures limited interaction in one update.
Iii Methodology
We describe the road network as a set of links , where is the total link number in the map and is the link ID ranging from 1 to . We then give the definition of ETA learning problem which is essentially a regression task:
Definition III.1
ETA Learning. Suppose we have a collection of historical trips , where stands for the total trip number, is the departure time, is the arriving time, is the driver ID and is the travel path for th trip. Our goal is to fit a model that can predict the travel time estimation given the departure time, the driver ID and the travel path. The groundtruth travel time can be computed as . The travel path is represented as a sequence of links , where is the ID of th link in the th sequence and is the sequence length of .
We introduce the overall framework of the proposed method in Section IIIA, define the measurement of link similarity in Section IIIB and introduce the details of our metric learning loss in Section IIIC.
Iiia Overall Framework
We first construct a rich feature set from the raw information of trips. For example, according to the departure time, we can obtain the time slice in a day (every 5 minutes) and the day of week. The features can be categorized into two types: (1) the sequential features which are extracted from the travel path . For a link , we denote its feature vector as , and get a feature matrix for the th trip. Note that the sequential feature has variable size — in other words, the column number of is decided by the path length; and (2) the nonsequential features which are irrelative to the travel path, e.g day of the week. They are represented as a feature vector with fixed size.
The link embedding vector is an important component of the link feature vector . For link with ID=, we look up an embedding table , and use its th column
as a distributional representation for the link
[1] . The is randomly initialized and will be updated in the training process by gradient descending to encode semantic information of links. The link feature vector is a concatenation of , the link length and the link’s travel speed :(1) 
The link’s length is obtained by geographical survey and the travel speed is the average speed of the floating cars that traversed the link within the latest time window (e.g 10 minutes).
Data amount significantly affects the quality of embedding vectors. For example in the natural language processing field, Word2vec
[19] cannot generate meaningful embedding vectors for rare words that occur in very limited sentences. In ridehailing platforms, the data coverage on road network is still not satisfactory though there are already millions of floating cars. A part of links are traversed by only a few or even zero trajectories. We refer to those traversed by plenty of trips as hot links, and those traversed by only a few or even zero trips as cold links. The hot links’ embedding vectors can be well trained with sufficient iteration. However, the training of cold links’ embedding vectors is often ended in an underfitting status, which undermines the accuracy of ETA prediction.To improve the embedding quality of cold links, we propose the Road Network Metric Learning ETA (RNMLETA), whose training process consists of two tasks. The main task is to predict the travel time, while the auxiliary task is to regularize the link embedding vectors by transferring the knowledge of road network patterns from hot links to cold links. The metric learning in the auxiliary task can help to place the embedding vector of a cold link in a proper position in the embedded space, by reducing the distance to its similar hot links. The loss function of RNMLETA is:
(2) 
where is a hyperparameter to balance the tradeoff between the main task and the auxiliary task.
We choose WideDeepRecurrent (WDR) model [26], a stateoftheart ETA model, to accomplish the main task. The three components of WDR model includes: (1) a wide module memorizing the historical patterns in data by constructing a second order cross product and an affine transformation of the nonsequential feature ; (2) a deep module improving the generalization ability by feeding
into a MultiLayer Perceptron (MLP), which is a stack of fullyconnected layers with ReLU
[13]activation functions; and (3) a recurrent module providing a finegrained modeling on the sequential featurevia LongShort Term Memory network (LSTM)
[10], which can capture the spatial and temporal dependency between links.We denote the outputs of the wide module as , the output of the deep module as , and the last hidden state of LSTM as . The travel time prediction is given by a regressor, which is also a MLP, based on the concatenation of the outputs:
(3) 
The hidden state sizes in the deep module, the LSTM and the regressor MLP are all set to 128. The hidden state and memory cell of LSTM are initialized as zeros. We choose Mean Absolute Percentage Error (MAPE) as the loss function of the main task:
(4) 
where is the groundtruth travel time. The overall architecture of RNMLETA and the main task workflow are visualized in Fig. 2. The details of the auxiliary task will be introduced in the following sections.
IiiB Link Similarity
To apply metric learning on the link embedding vectors, a similarity measurement of links should be defined. Since the link’s travel speed essentially reflects how long a car is expected to take to pass through the link, the speed distribution across different time could be used to depict the traffic characteristic of the link. We construct a series of time bins for a day. These time bins are ensured to be nonoverlapped: ; and their union covers the whole day: . We then statistic the average travel speed for link and time bin by computing:
(5)  
where is the travel speed feature of th link in th trip, and is an indicator that if is satisfied and otherwise. Intuitively, we find a subset of the link ’s travel speed features by selecting those whose departure time belongs to the time bin , and then compute the average on the subset. In practice, we use a configuration of time bins with from 5 a.m to 11 a.m representing the morning peak, from 4 p.m to 10 p.m representing the evening peak and taking the remaining hours representing the offpeak time.
We further scale the speeds to be within by applying , where and are the minimum and maximum of . We finally get a normalized speed histogram of link :
(6) 
A difference matrix can be computed as follows:
(7) 
where is the element of measuring the difference between links with ID= and ID=. Smaller difference means larger similarity. The similarity based on speed histogram shows advantages on two aspects. Firstly, the ETA is mostly determined by the traffic condition and is partially influenced by personalized factors such as the driving habit. The latest average speed is a good reflection of the traffic condition. If two links have similar speed distribution, they should also have similar impact on the ETA prediction. Secondly, the speed histogram does not rely any extra information and can be computed directly from the data used in the main task, which facilitates the method implementation.
IiiC Triangle Loss
Links with similar characteristic are expected to be closer in the embedded space and those with dissimilar characteristic are expected to be farther. With this end in view, we propose a novel metric learning loss function, named as triangle loss. Suppose we have three links with ID= and the corresponding differences , and , without loss of generality, we assume:
(8) 
We then compute the Euclidean distances between the embedding vectors of link , and . For example:
(9) 
where is the L2 normalized embedding vector. The three distances , and forms a triangle. We aims to restrict the lengths of the triangle edges to be in the same order as in Eq. 8, which derives three inequations:
(10)  
where , and are required margins. Unlike the triplet loss [21] which has only one restriction that the distance between anchor and positive sample should be smaller than the distance between anchor and negative sample, the links in our method take turns to act as the anchor. This enables a more efficient metric learning in one update and thus accelerates the convergence. Fig. 3 gives a visualized demonstration. The triangle loss is in the form of:
(11) 
where the operator and is the number of possible triangles in the training set, , and are hyperparameters to adjust the weights of the three distances. The auxiliary task and main task are simultaneously optimized via gradient descending. For a minibatch of trips, we first compute the loss of the main task, and then compute the auxiliary loss by randomly combining triangles with all the links in the trips.
Iv Experiment
The evaluation is on large scale realworld datasets collected in DiDi platform. We will introduce the datasets, the competing methods, the implementation details and the experimental results in sequence.
Iva Dataset
We collected massive floating car trajectories of Beijing in 2018 in DiDi platform. The trajectories are split into pickup and trip datasets according to the driver’s working status. A pickup trajectory starts when a driver responds to a passenger’s request and ends when he/she picks up the passenger. A trip
trajectory starts when the passenger gets on board and ends when arriving the destination. For each dataset, we use 25 weeks of data as training set and the following 2 weeks as validation set and test set, respectively. We remove the outliers with extremely short travel time (
60s) and extremely high average speed (120km/h). The data statistics are summarized in Table I.size  pickup  trip  

training set  25 weeks  111.0M  105.5M 
validation set  1 week  4.0M  4.5M 
test set  1 week  4.1M  3.9M 
# traversed link    1.2M  1.3M 
The links are from a wide range of roads, such as private community roads, local streets and urban freeways. As shown in Table I, the trip dataset covers more links than the pickup dataset. However, both the datasets suffer from the road network sparsity problem that most of the links are short of data. To demonstrate it, we plot the histogram of link coverage frequency in Fig. 4. Though with over 0.1 billion of trajectories, there is a significant number of cold links that are traversed by only a few times in about half a year (25 weeks). The median coverage frequencies of link are 42 on pickup and 69 on trip.
IvB Competing Methods
We compare the proposed RNMLETA with the following competitors.
(1) RouteETA: a representative method in industrial application. In this solution, the travel time estimation for each link is made by dividing the link length by the link travel speed. The waiting time at each intersection is mined from the historical data. Given a route, the total travel time is predicted as the sum of each link’s travel time and each intersection’s waiting time. RouteETA has very fast inference speed but its accuracy is often far from satisfactory compared to deep learning methods.
(2) WDR [26]: a deep learning method achieving the stateoftheart performance in ETA problem. Since it is the model used in our main task, the comparison between WDR and RNMLETA evaluates the benefit of the auxiliary task.
(3) WDRnolinkemb: a variant of WDR that removes the link embedding technique. The main purpose of using this model is to quantify the contribution of link embedding vectors, of which the RNMLETA is aiming to improve the quality.
Besides the Mean Absolute Percentage Error (MAPE), which is used as objective function in the main task, we also take Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) as the evaluation metrics. The computations are:
MAE  (12)  
RMSE 
IvC Implementation Details
The neural networks in WDR, WDRnolinkemb and RNMLETA
are implemented in PyTorch
[20], and the training is accelerated on a single NVIDIA P40 GPU. We use a minibatch size of 256 and set the maximal iteration number to 7 millions. The hyperparameters of RNMLETA are selected by the results on validation set. We use margins , and weights , in the triangle loss for both pickup and trip datasets. The task weight is 0.52 for pickup and 0.35 for trip. All the parameters, such as the MLP weights and the embedding vectors, are jointly trained using Adam [12] optimizer, which is a stochastic gradient descending method. Adam can adaptively adjust the step size according to the historical gradients and thus accelerate the convergence. The learning rate is set to 0.0002.IvD Experimental Results
We list the results of pickup data in Table II and trip data in Table III, and mark the best scores by bold font. The proposed method RNMLETA outperforms all the competitors on both datasets. The metric learning component significantly improves the main task model’s accuracy to predict the travel time. For example, RNMLETA reduces RMSE on pickup data and reduces MAPE on trip data compared to WDR. The importance of link embedding technique is also validated that it brings and reduction on MAPE for pickup and trip data, respectively (WDRnolinkemb v.s. WDR). Moreover, it can be observed that there is a large performance gap between the simple rulebased model RouteETA and the deep learning models.
MAPE (%)  MAE (sec)  RMSE (sec)  

RouteETA  69.008  106.966  
WDRnolinkemb  59.018  95.876  
WDR  54.686  89.976  
RNMLETA  19.215  53.546  87.617 
MAPE(%)  MAE (sec)  RMSE (sec)  

RouteETA  150.560  248.736  
WDRnolinkemb  117.337  197.652  
WDR  108.919  186.083  
RNMLETA  11.597  108.519  185.897 
The results in Table II and Table III show the overall accuracy on all the links. Since RNMLETA mainly aims to improve the embedding quality of cold links, its contribution needs a finer evaluation which reports the metrics at different link coverage level. Thus, we select a series of subsets from the dataset by restricting the link coverage frequency in the trajectory. Specifically, we keep a trajectory if at least of the contained links have coverage frequencies less than a threshold , and drop the trajectory otherwise. By varying from 50 to 500 on pickup data and from 300 to 750 on trip data in a step of 50, we obtain 10 subsets for each dataset. In subset with lower , the trajectory contains more cold links. We then compute the metrics on these subsets and plot the curves in Fig. 5.
We take Fig. 5 (a) as an example (the trends in other subfigures are similar). As the threshold increases, the subset includes more hot links and the MAPE of WDR gradually decreases from to , which is a large improvement for ETA problem. This phenomenon shows that links covered by more trajectories do have better prediction accuracy and supports the existence of the road network data sparsity problem. On the subset with , our method RNMLETA outperforms WDR by more than 2 percentage in terms of MAPE. However, the gain on overall MAPE (Table II) is less than 0.2 percentage. Such a comparison validates the effectiveness of RNMLETA that it mainly improves the performance of cold links. As increases, RNMLETA achieves MAPE improvements up to on pickup data and up to on trip data.
IvE Influence of Hyperparameter
To explore the influence of hyperparameters, we plot the performance curves of pickup data in Fig. 6 by varying the margin and the task weight , which are two representative hyperparameters. The basic configuration is the same as in Section IVC, namely, , , , and .
The hyperparameter is a bit more special than and , because it controls the gap between the longest edge and the shortest edge in the triangle loss. If this restriction is broken, it means that the model is far from our expected status and needs a stronger gradient to update the parameters. Usually, we set and find that achieves the best performance according to the curve in Fig. 6 (a). Moreover, RNMLETA achieves better performance than WDR from to , which demonstrates that the superiority of RNMLETA is not sensitive to the margin hyperparameter.
The task weight is to balance the tradeoff between the main task and the auxiliary task. In extreme cases, RNMLETA degenerates to WDR if and degenerates to a pure metric learning model if . Fig. 6 (b) shows that the advantage of RNMLETA over WDR is robust in a wide range of from to and that the best performance is achieved at .
V Conclusion
In this paper, we propose a novel metric learning framework for ETA, named as RNMLETA, to address the data sparsity problem of road network. In the main task, we use WDR model to predict the travel time. In the auxiliary task, we first construct a difference matrix by computing the Euclidean distances between the links’ speed distributions, and then use metric learning to get the similar links close and dissimilar links far away in the embedded space. The auxiliary task is aiming to improve the quality of embedding vectors of links. We conduct experiments on two large scale realworld datasets collected in DiDi platform. The results validated the effectiveness of RNMLETA by showing that it outperforms the stateoftheart WDR model on all the evaluation metrics. A further experiment finely examines the gains for different types of link and find that RNMLETA significantly improves the accuracy for routes containing cold links.
References
 [1] (2003) A neural probabilistic language model. Journal of machine learning research 3 (Feb), pp. 1137–1155. Cited by: §I, §IIIA.
 [2] (1997) Multitask learning. Machine learning 28 (1), pp. 41–75. Cited by: §I.

[3]
(2013)
Dynamic travel time prediction using pattern recognition
. In 20th World Congress on Intelligent Transportation Systems, Cited by: §II.  [4] (2005) Learning a similarity metric discriminatively, with application to face verification. In CVPR, Vol. 1, pp. 539–546. Cited by: §II.
 [5] (2016) Understanding congested travel in urban areas. Nature communications 7 (1), pp. 1–8. Cited by: §I.
 [6] (2010) Intelligent transportation systems. IEEE Vehicular Technology Magazine 5 (1), pp. 77–84. Cited by: §I.
 [7] (2001) Towards the development of intelligent transportation systems. In ITSC (Cat. No. 01TH8585), pp. 1206–1211. Cited by: §I.
 [8] (2019) DeepIST: deep imagebased spatiotemporal network for travel time estimation. In ACM CIKM, pp. 69–78. Cited by: §II.
 [9] (2019) Attention based spatialtemporal graph convolutional networks for traffic flow forecasting. In AAAI, Vol. 33, pp. 922–929. Cited by: §I.
 [10] (1997) Long shortterm memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §IIIA.
 [11] (2012) Learning the dynamics of arterial traffic from probe data using a dynamic bayesian network. IEEE Transactions on Intelligent Transportation Systems 13 (4), pp. 1679–1693. Cited by: §II.
 [12] (2015) Adam: a method for stochastic optimization. ICLR, San Diego. Cited by: §IVC.
 [13] (2012) Imagenet classification with deep convolutional neural networks. In NeurIPS, pp. 1097–1105. Cited by: §II, §IIIA.
 [14] (2019) Travel time estimation without road networks: an urban morphological layout representation approach. In IJCAI, pp. 1772–1778. Cited by: §II.
 [15] (2009) Exploring strategies for training deep neural networks. Journal of machine learning research 10 (Jan), pp. 1–40. Cited by: §II.
 [16] (2015) Deep learning. nature 521 (7553), pp. 436. Cited by: §II.
 [17] (2018) Multitask representation learning for travel time estimation. In SIGKDD, pp. 1695–1704. Cited by: §II.
 [18] (2013) Investigation of recurrentneuralnetwork architectures and learning methods for spoken language understanding.. In Interspeech, pp. 3771–3775. Cited by: §I.
 [19] (2013) Distributed representations of words and phrases and their compositionality. In NeurIPS, pp. 3111–3119. Cited by: §I, §IIIA.
 [20] (2019) PyTorch: an imperative style, highperformance deep learning library. In NeurIPS, pp. 8024–8035. Cited by: §IVC.

[21]
(2015)
Facenet: a unified embedding for face recognition and clustering
. In CVPR, pp. 815–823. Cited by: §II, §IIIC.  [22] (2016) Improved deep metric learning with multiclass npair loss objective. In NeurIPS, pp. 1857–1865. Cited by: §II.
 [23] (2018) When will you arrive? estimating travel time based on deep neural networks. In AAAI, Cited by: §II.
 [24] (2016) A simple baseline for travel time estimation using largescale trip data. In SIGSPATIAL GIS, pp. 61. Cited by: §II.
 [25] (2014) Travel time estimation of a path using sparse trajectories. In SIGKDD, pp. 25–34. Cited by: §II.
 [26] (2018) Learning to estimate the travel time. In SIGKDD, pp. 858–866. Cited by: §I, §I, §II, §IIIA, §IVB.
 [27] (2011) Tdrive: enhancing driving directions with taxi drivers’ intelligence. IEEE Transactions on Knowledge and Data Engineering 25 (1), pp. 220–232. Cited by: §II.
 [28] (2013) Urban link travel time estimation using largescale taxi data with partial information. Transportation Research Part C: Emerging Technologies 33, pp. 37–49. Cited by: §II.
 [29] (2016) Urban link travel time prediction based on a gradient boosting method considering spatiotemporal correlations. ISPRS International Journal of GeoInformation 5 (11), pp. 201. Cited by: §II.
 [30] (2018) Deeptravel: a neural network based travel time estimation model with auxiliary supervision. In IJCAI, pp. 3655–3661. Cited by: §II.
 [31] (2011) Datadriven intelligent transportation systems: a survey. IEEE Transactions on Intelligent Transportation Systems 12 (4), pp. 1624–1639. Cited by: §I.
Comments
There are no comments yet.