Log In Sign Up

FMA-ETA: Estimating Travel Time Entirely Based on FFN With Attention

Estimated time of arrival (ETA) is one of the most important services in intelligent transportation systems and becomes a challenging spatial-temporal (ST) data mining task in recent years. Nowadays, deep learning based methods, specifically recurrent neural networks (RNN) based ones are adapted to model the ST patterns from massive data for ETA and become the state-of-the-art. However, RNN is suffering from slow training and inference speed, as its structure is unfriendly to parallel computing. To solve this problem, we propose a novel, brief and effective framework mainly based on feed-forward network (FFN) for ETA, FFN with Multi-factor self-Attention (FMA-ETA). The novel Multi-factor self-attention mechanism is proposed to deal with different category features and aggregate the information purposefully. Extensive experimental results on the real-world vehicle travel dataset show FMA-ETA is competitive with state-of-the-art methods in terms of the prediction accuracy with significantly better inference speed.


page 1

page 2

page 3

page 4


DiSAN: Directional Self-Attention Network for RNN/CNN-Free Language Understanding

Recurrent neural nets (RNN) and convolutional neural nets (CNN) are wide...

Efficient conformer-based speech recognition with linear attention

Recently, conformer-based end-to-end automatic speech recognition, which...

Spatial-Temporal Dual Graph Neural Networks for Travel Time Estimation

Travel time estimation is a basic but important part in intelligent tran...

A Long-term Dependent and Trustworthy Approach to Reactor Accident Prognosis based on Temporal Fusion Transformer

Prognosis of the reactor accident is a crucial way to ensure appropriate...

TLETA: Deep Transfer Learning and Integrated Cellular Knowledge for Estimated Time of Arrival Prediction

Vehicle arrival time prediction has been studied widely. With the emerge...

Fusion Recurrent Neural Network

Considering deep sequence learning for practical application, two repres...

Global Structure-Aware Drum Transcription Based on Self-Attention Mechanisms

This paper describes an automatic drum transcription (ADT) method that d...

1 Introduction

Estimated time of arrival (ETA) or travel time prediction is universally considered as the travel time estimation given a pair of origin and destination locations along the route Wang et al. (2018b). As an essential component of artificial intelligence for transportation, ETA influences route planning, navigation and vehicle dispatching which are fundamental for ride-hailing platforms, such as DiDi and Uber Wang et al. (2018b, a). ETA is a representative and challenging sequence learning and data mining task attracting lots of attention Wang et al. (2014); Hofleitner et al. (2012); Chen et al. (2013); Zhang et al. (2016); Wang et al. (2018b, a); Li et al. (2018).

Since 2018, deep learning LeCun et al. (2015) based methods Wang et al. (2018a); Li et al. (2018); Wang et al. (2018b) which significantly overperform non-deep learning-based methods Wang et al. (2014); Hofleitner et al. (2012); Chen et al. (2013); Zhang et al. (2016) mines the spatial-temporal correlations concurrently and effectively from large-scale data and become state-of-the-art. The general sequential semantic information extractor of these state-of-the-art methods, such as WDR Wang et al. (2018b), DeepTTE Wang et al. (2018a), DeepTravel Zhang et al. (2018) are mainly one Recurrent Neural Network (RNN) Hopfield (1982); Jordan and I (1997); Elman and L (1990)

variant, Long Short-Term Memory Network (LSTM) 

Hochreiter and Schmidhuber (1997). RNN adopts the recurrent structure to model sequence and extract semantic information, which also determines its restricted inference speed due to non-parallelization.

In this paper, we discuss the possibility of mainly adopting FFN to mine spatial-temporal information from sequential massive data for ETA, as illustrated in Fig. 1. FFN is parallelizable and naturally beneficial for fast ETA inference considering accuracy simultaneously which is a industry pain point for ride-hailing platforms. However, completely depending on FFN, the model can hardly capture the dependency between links.

Figure 1: The conceptual demonstration of ETA and two kinds of candidate sequence feature extractors based on deep learning. ETA refers to estimating the travel time along the given route between the origin and destination. In real application scenarios, the RNN, such as current state-of-the-art LSTM is copmlex and slow. FFN with our proposed Multi-factor Attention (FMA-ETA) is promising for the future of ETA due to simplicity, high speed and effectiveness.

Will there be a novel structure helping FFN analyze sequence semantic information effectively and play its obvious advantages in inference speed for ETA? Follow this line, we present a novel Multi-factor Attention which is specially designed for ETA, a sequential learning task affected by various factors. FMA-ETA which is mainly based on FFN with Multi-factor Attention is proposed for a better sequence feature extractor than RNN which is state-of-the-art since 2018.

The main contributions in this work are as follows:

  • We propose a novel ETA deep learning based framework, FMA-ETA that is the first deep learing framework entirely based on FFN with attention, to our best knowledge.

  • We propose a novel Multi-factor Attention mechanism for effectively learning the time dependency and semantic information between time steps of the sequence. Through sufficient experiments, we find that for ETA, Multi-factor Attention is better than Multi-head attention Vaswani et al. (2017)

    which is famous in natural language processing. Besides, Multi-factor attention can be adpoted and may also be promising for other sequence learning tasks affected by various factors.

  • We evaluate FMA-ETA on the massive real-world dataset containing over 500 million trajectories from one famous ride-sharing platform. The abundant experimental results demonstrate that FMA-ETA’s estimation precision is comparable with the state-of-the-art RNN based method, WDR. Not only that, FMA-ETA improves the ETA model inference speed than WDR significantly.

We organize the paper as follows. Section 2 briefly summarizes the backgrounds of ETA, sequence learning and attention mechanism. Section 3 introduces the overall framework of FMA-ETA, followed by the description of the general Multi-factor attention in detail. In Section 4, we elaborate the reason why we propose Multi-factor attention. In Section 5, experimental result comparisons on the large-scale real-world dataset are presented to show the excellent accuracy and inference speed of FMA-ETA. Finally, this paper is concluded and the possibility of further work is analyzed in Section 6.

2 Background

In this section, we briefly overview the background of our work, inculding estimated time of arrival and attention mechanism.

2.1 Estimated time of arrival

Estimated time of arrival (ETA) is a challenging problem in the field of intelligent transportation system. There are two representative methods for solving ETA, route-based method and data-driven method. The route-based method focus on formulate the travel time of a given route as the summation of time on each road segment and each intersection. Traditional machine learning methods such as dynamic Bayesian network

Hofleitner et al. (2012), least-square minimization Zhan et al. (2013)

and pattern matching

Chen et al. (2013) are typical approaches to capture the spatial-temporal features in the route-based method. However, the idea of dividing the original trajectory results in the accumulation of local errors. The data-driven method has shifted from traditional methods such as TEMP Wang et al. (2019) and time-dependent landmark graph Yuan et al. (2011) to deep-learning based methods Wang et al. (2018b); Fu et al. (2020). MURAT Li et al. (2018) uses multi-task learning and graph convolutional networks to assist a residual block to predict the travel time from the departure to the destination without a given trajectory. In recent years, researchers have conducted more explorations on applying deep learning methods to solve ETA problems, such as Deeptravel Zhang et al. (2018), DeepTTE Wang et al. (2018a), Deepi2t Lan et al. (2019) and WDR Wang et al. (2018b). These methods apply different approaches on modeling spatial information, but they all use LSTM Hochreiter and Schmidhuber (1997) to extract features from time series. However, the inference speed of the model with LSTM is too slow to be applied in actual scenarios. In this work, we proposed FMA-ETA which can sufficiently handle the above problem.

2.2 Attention mechanism

Attention is a very effective mechanism in natural language processing Bahdanau et al. (2014), image caption Xu et al. (2015) and other research areas Ren et al. (2019). Attention mechanism has outstanding ability in capturing semantic dependencies. Common attention mechanisms are local attention, global attention Luong et al. (2015), self-attention Vaswani et al. (2017), etc. Transformer Vaswani et al. (2017) is a novel sequence to sequence network entirely based on FFN with Multi-head self-attention. It achieves promising results in translation with a faster speed than RNN-based models. Then self-attention becomes a hot topic in neural network attention research. Self-attention is calculated by:


where is query, key and value matrix, is the dimension of key and query matrix, and key and query matrix are usually the same. Self-attention is proved useful in a wide variety of tasks including sequential recommendation Kang and McAuley (2018), reading comprehension Yu et al. (2018), speech recognition Salazar et al. (2019) and traffic flow predicting Zhu et al. (2018).

Deep learning-based ETA models are mostly RNN-based models. RNN has problems when deals with long-range dependencies. LSTM are able to deal the problem to some extent, but in practice it still have problems in long-range dependencies. The inference time of LSTM-based model is too long for practical application, so it is vital to introduce attention mechanism to the ETA problem.

3 Model Architecture

We first give the accurate mathematical definition of estimated time of arrival with reference to Wang et al. (2018b).

Definition 3.1 (Estimated time of arrival).

For a collection of historical trips , where is the departure time for the i-th trajectory, is the arrival time for i-th trajectory, is the driver ID, is the link sequence set of the trajectory, and is the total number of samples. The ground truth travel time is computed by . Here the link sequence set can be represented as , where represents the j-th link in the i-th trajectory, and is the length of the link sequence set.

Sicne 2018, most state-of-the-art methods’ main force for capturing spatial-temporal patterns to complete ETA has change into RNN (specifically, LSTM). RNN is a famous general sequence feature extractor for various sequence learning subfields, such as speech signal processing and natural language processing. In this paper, we break the stereotype and present a ETA framework entirely based on FFN and novel Multi-factor Attention, FMA-ETA. We introduces the overall structure of FMA-ETA as well as proposed Multi-factor attention in next two subsections.

3.1 Overall framework

The first main step is the sophisticated feature engineering where we follow Wang et al. (2018b). Rich features from massive raw data is the key input for deep learning model. Features could be divided into the following two categories.

(1) Global features are sparse and one trajectory corresponds to one global representation, such as driverID, day of week, departure time slice. The method, Embedding Bengio et al. (2003) is adopted for the dimensionality reduction of sparse features.

(2) Sequential features are related to each link of the trajectory, for instance, length of the link (road segment), speed (road contidion), link time (related to road contidion) and embedding of linkID. These four factors influence ETA from different perspectives.

We then describe the overall framework of FMA-ETA, as shown in Fig. 2.

Figure 2: FMA-ETA two main components for sequential features: FFN and Multi-factor Attention.

Two main components of FMA-ETA are Multi-factor Attention and FFN.

(1) Sequential features adopt our Multi-factor Attention which will be discussed in detail in next subsection. This component fully explores the relationship between different links in each track.

(2) Parallelizable FFN is the main reason for simplicity and fast inference that are our greatest advantages compared with RNN. The front FFN is utilized for each sequential factor to mining the spatial-temporal patterns in their single aspect as well as for concatenated factors. The last FFN is for the information aggregation from sequential separate and combined representations as well as embeddings of global features.

The regressor is one linear layer with ReLU 

Krizhevsky et al. (2012)

as the activation function. The Objective function of the overall deep learning model is the mean absolute percentage error (MAPE) which is an common and relative loss function for ETA. The

FMA-ETA’s parameters are trained through:


where is query’s ETA, is the ground truth time and is all the parameters of FMA-ETA.

3.2 Multi-factor Attention

ETA is a challenging and complex problem due to the fact that various factors affect the accuracy of prediction, such as the link length as well as its road condition. Therefore, unlike for natural language processing where one word can be represented by a single embedding vector, different sequential features ought to be treated and dealt with more specifically for ETA. Our Multi-factor Attention mechanism is proposed to let different sequential factors mine its patterns and the impact on ETA in different subspaces, as shown in the upper left corner of Fig. 

2. Self attention is with reference to Vaswani et al. (2017)

and we add position encoding, residual connections, layer normalization, and dropout after self attention following 

Vaswani et al. (2017). Combined sequential features also capture the spatial-temporal patterns as a whole by FFN with self-attention.

In the Fig. 2, we show the Multi-factor Attention mechanism with three factors. When the number of the factors is arbitrary , the general Multi-factor Attention could be expressed:


Where is the i-th factor of ETA problem, is the learned parameters in the FFN layers of the i-th factor, is the learned parameters in the FFN layers of the combined features. For our FMA-ETA, sequential factors contains (1) length of the link, (2) road contidion speed, (3) corresponding link time and (4) embedding of linkID. Hence, we adopt the Multi-factor Attention version of four factors, i.e., . Through concatenating the separate and combined sequence representations, our Multi-factor Attention complete the multi level and detailed extraction of the spatial-temporal dependencies of sequence data.

4 Why Multi-factor Attention

In this section, we will discuss the motivation and reason for the proposal of multi-factor attention. The RNN-based model has a good performance on the ETA problem of which the evaluation metric is good. However, the RNN-based model has a slow training/inference speed, making it difficult to be applied in practical problems. FFN is a promising method to accelerate the speed of the model. But FFN has a poor performance in sequence learning and have problems on long-range dependencies. Our proposed multi-factor attention can solve the above problem. We will analyze and compare the total computational complexity and sequential operations of RNN and FFN with Multi-factor Attention.

As shown in Table 1, the multi-factor attention only need sequetial operations while RNN requires . As for computational complexity, when the length of sequence is smaller than the dimension of features , our multi-factor attention is faster than RNN.

Self attention especially multi-head attention has achieved good results on sequence learning. Why not multi-head attention? In terms of ETA problem, there are many different factors affect it and the traffic state is complex and dynamic. Experiments in Section 5 show that multi-head attention does not perform well on complex problems in the transportation system like ETA. Our multi-factor attention focus on both separate features and combined features. In this way can we promote different subspaces to analyze the effect of a certain factor pattern on ETA. The evaluation metrics shows that model with multi-factor attention preforms better than model with multi-head attention on ETA problems.

Hence, multi-factor attention is more effective for extracting systematic and comprehensive spatial-temporal patterns comparing with multi-head attention. Considering the speed promotion of FFN, FFN with multi-factor attention has a great advantage in tasks in intelligent transportation system (ITS). Multi-factor attention is a general method and may be also promising for other time series forecasting tasks.

Complexity per Layer Sequential Operations
Multi-factor Attention
  • is the length of the sequence, is the dimension of features.

Table 1: The per layer complexity and sequential operations of different methods

5 Results

5.1 Dataset

We evaluate our model on a large-scale real-world floating-car trajectory dataset Beijing 2018 collected by a ride hailing platform. It contains the trajectory data of hundreds of millions of Beijing taxi drivers after desensitization for more than 4 months in 2018. This dataset covers different types of roads in Beijing urban areas, including local streets and freeways. We filter out the abnormal data where the driving time is less than 60s or the speed exceeds 120 km/h in Beijing 2018. We divided this data set into atraining set (the first 16 weeks of data), a validation set (the middle2 weeks of data) Test set (data for the last 2 weeks).

5.2 Compared methods

We compared the proposed FMA-ETA with the following competitors:

(1) Route-ETA: a representative method for traditional non-deep learning methods. It divides the trajectory into several links and intersections. The travel time in the i-th link of this trajectory is calculated by dividing the link’s length by the estimated speed in the i-th link. The delay time in the j-th intersection is provided by a real-time traffic monitoring system.The final arrival time is the sum of the estimated time spent in each subsection.

(2) WDR(RNN): a deep learning method achieving the state-of-the-art performance in ETA problem. WDR is a joint model including width module, depth module, and recurrent module. It can effectively use the dense features, high-dimensional sparse features and local features of road sequence in traffic information. Here we use RNN in the recurrent module.

(3) WDR(LSTM): a variants of WDR(RNN). Here we use LSTM in the recurrent module of WDR and we make no changes to other part of WDR.

(4) WD-FFN

: a variants of WDR. It uses deep module to replace the recurrent module. Here we use a Multi-Layer Perceptron network for comparision.

(5) WD-Resnet: a variants of WDR. It uses deep module to replace the recurrent module. Here we use a residual structure to extract features.

(6) Multi-head attention: a variants of WDR. We use FFN with multi-head attention mechanism instead of RNN to extract features from the sequential data.

If this work is accepted, we will open source the codes of proposed deep learning-based model,


5.3 Experimental Settings

In our experiment, all models are written in PyTorch. They are trained and evaluated on a single NVIDIA Tesla P40 GPU. The number of iterations of the deep learning-based method is 3.5 million. We use the method of Back Propagation (BP) to train the deep learning-based methods, and the loss function is the MAPE loss. We choose Adam as the optimizer due to its good performance. The batch size is 256 and the initial learning rate is 0.0002.

5.4 Evaluation Metrics

To evaluate and compare the performance of FMA-ETA and other baselines, we use evaluation metrics, Mean Absolute Percentage Error (MAPE), Mean Absolute Error (MAE) and Rooted Mean Square Error (RMSE):


where is the predicted travel time, and is the ground truth travel time. The calculation process of MAPE is shown in Section 3.

5.5 Experimental Results and Analysis

MAE(sec) RMSE(sec) MAPE(%) Latency(ms) *
Route-ETA 69.008 106.966 25.010 0.179
WD-FFN 57.797 93.588 21.106 0.344
WD-Resnet 57.064 92.241 21.015 0.454
WDR(RNN) 55.284 90.836 19.677 1.107
WDR(LSTM) 55.227 90.480 19.598 1.109
Multi-head attention 55.145 90.101 19.678 0.635
FMA-ETA (ours) 54.642 88.794 19.618 0.866
  • Latency is the average inference time of the models.

Table 2: The results of different methods

Table 2 shows the general three evaluation metrics for ETA problems. Our FMA-ETA outperforms all competitors in terms of MAE and RMSE metrics. FMA-ETA achieves similar results with the start-of-the-art method WDR(LSTM) in terms of MAPE metric. The detailed analysis of the experimental results are as follows.

(1) The representative non-deep learning method, route-ETA performs worse than other deep learning based methods. It shows that the data-driven method is more effective than route-based method. The deep method is suitable for modeling complex transportation system given massive spatio-temporal data.

(2) Models with recurrent module performs better than models that only use deep modules without attention mechanism. WDR(RNN) and WDR(LSTM) achieves better results than WD-FFN and WD-Resnet. WDR(LSTM) performs best on MAPE metric, because the use of gated units can solve the problem of long-term dependencies to a certain extent. The deep modules with attention achieve better results than WDR on MAE and RMSE metrics, which means attention mechanism can help to extract features and sole the long-range dependencies in long sequence.

(3) Our FMA-ETA performs best on MAE and RMSE metrics, which means our method is very applicable to ETA problems. Our FMA-ETA outperforms LSTM by 1.05% in terms of MAE loss and 1.86% in terms of RMSE loss. Our FMA-ETA perform similar results to WDR(LSTM), and our FMA-ETA only 0.1% worse than WDR(LSTM) in terms of MAPE metric. Considering the three evaluation metrics, our FMA-ETA performs best on ETA problems.

(4) As can be seen in the "Latency" column of Table 2, our FMA-ETA speed up the inference process by compared with WDR(LSTM). Route-ETA has the shortest time of 0.179s, but its performance on evaluation metrics is poor. FFN-based methods without attention mechanism is fast, but it brings a great loss on the evaluation metrics. Model with multi-head attention is faster than FMA-ETA. Its performance is worse than FMA-ETA. If the performance of models for ETA problem is not good enough, many tasks of ETA’s downstream in intelligent transportation systems such as route planning, navigation and vehicle dispatching will be affected greatly. Therefore, we should increase the inference speed of the model while ensuring the accuracy. Currently only our FMA-ETA can reach the goal.

Our FMA-ETA has a good performance on the ETA problem, and it greatly prompts the inference speed compared with the state-of-the-art method WDR(LSTM). FMA-ETA achieves clear improvements over WDR(LSTM) regarding to MAE and MAPE metrics. Taking into account both evaluation metrics and speed, our method is the most suitable method for ETA problems.

5.6 Speed comparison of different methods

Figure 3: The inference time of WDR(LSTM) and FMA-ETA.

As we analyzed above, the state-of-the-art method WDR(LSTM) for ETA problem in previous literature takes a long time for inference. This makes WDR(LSTM) hard to be applied in the real-time traffic system. FFN can greatly accelerate the inference speed of the model for ETA problem, such as WD-FFN and WD-Resnet, but it causes a large decrease in accuracy which can be seen in Table 2. The attention mechanism can help FFN to effectively extract sequence features. The existing multi-head attention improves the inference speed, but it still brings a great loss of accuracy. Our goal is to increase the inference speed of the ETA model while ensure that the evaluation metrics do not decrease. We can see from the average inference speed in Tabel 2, only FMA-ETA can achieve the goal. We further explore the inference speed of WDR(LSTM) and FMA-ETA with different sequence length. We randomly sample 50 samples at each sequence length for WDR(LSTM) and FMA-ETA, then plot the scatters in Figure 3. The curve in Figure 3 is obtained by fitting the sampling points through logarithmic fitting.

As illustrated by the figure, our FMA-ETA is obviously faster than WDR(LSTM) when the sequence length is large than 180. LSTM-based model is fast in short sequences, and its consuming time increases rapidly as the sequence becomes longer. In actual car rides data, long-range sequences are common, so FMA-ETA is more applicable for practical problems.

6 Conclusion and Future Work

In this paper, to our best knowledge, we are the first to estimate travel time entirely based on FFN with attention by presenting FMA-ETA. This idea is novel and quite different from RNN based methods which have been state-of-the-art since 2018. Furthermore, we propose a novel Multi-factor self-attention mechanism for FFN to better mine sequence semantic information for ETA which is affected by various factors. Through sufficient experiments on the massive real-world dataset from a famous intelligent travel platform, we conclude that FMA-ETA achieves slight improvements over other state-of-the-art methods regarding to the prediction precision. Most importantly, our method significantly speeds up the inference process compared with RNN based methods. Multi-factor self-attention mechanism is also verified by experiments to be superior to the popular Multi-head self-attention that is proposed for natural language processing. Future efforts will be made to adopt our Multi-head self-attention for other sequence learning tasks which are also affected by many complex factors. Besides, we plan to conduct a series of online tests for FMA-ETA and decide if we could adopt this promising deep learning framework for large scale practical application.

Broader Impact

We present the statement of the broader impact of our paper as followed:

a) This research is benefitial for many other tasks in ITS, such as route planning, navigation and vehicle dispatching. Our FMA-ETA which is an significantly faster and more accurate framework for ETA do good to the ride-hailing platforms, such as DiDi and Uber, for providing better user experiencs. Furthermore, our method promotes the long-term development of ITS and the spatial-temporal sequential prediction;

b) We are convinced that nobody will be put at disadvantage from our work. On the contrary, our research indirectly makes it more convenient for many people to travel and helps environmental protection;

c) Our framework is a potential framework for online application, which reflects the practical application value of our model. If our model is lucky enough to be selected as the practical application method, the ride-hailing platform will also go through multi-directional tests in order to avoid the only economic loss once our method fails;

d) We ensure that the method does not leverage any biases in the data. The experiments are carried out on the large-scale real-world vehicle travel dataset. The data includes more than 500 million trajectories and covers almost all road types. Therefore, the distribution tends to be that of real world.


  • [1] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §2.2.
  • [2] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin (2003) A neural probabilistic language model. Journal of machine learning research 3 (Feb), pp. 1137–1155. Cited by: §3.1.
  • [3] H. Chen, H. A. Rakha, and C. C. McGhee (2013)

    Dynamic travel time prediction using pattern recognition

    In 20th World Congress on Intelligent Transportation Systems, Cited by: §1, §1, §2.1.
  • [4] Elman and J. L (1990) Finding structure in time. Cognitive science 14 (2), pp. 179–211. Cited by: §1.
  • [5] K. Fu, F. Meng, J. Ye, and Z. Wang (2020) CompactETA: a fast inference system for travel time prediction. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Cited by: §2.1.
  • [6] S. Hochreiter and J. Schmidhuber (1997) LSTM can solve hard long time lag problems. In Advances in neural information processing systems, pp. 473–479. Cited by: §1, §2.1.
  • [7] A. Hofleitner, R. Herring, P. Abbeel, and A. Bayen (2012) Learning the dynamics of arterial traffic from probe data using a dynamic bayesian network. IEEE Transactions on Intelligent Transportation Systems 13 (4), pp. 1679–1693. Cited by: §1, §1, §2.1.
  • [8] J. J. Hopfield (1982) Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences 79 (8), pp. 2554–2558. Cited by: §1.
  • [9] Jordan and M. I (1997) Serial order: a parallel distributed processing approach. In Advances in psychology, Vol. 121, pp. 471–495. Cited by: §1.
  • [10] W. Kang and J. McAuley (2018) Self-attentive sequential recommendation. In 2018 IEEE International Conference on Data Mining (ICDM), pp. 197–206. Cited by: §2.2.
  • [11] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §3.1.
  • [12] W. Lan, Y. Xu, and B. Zhao (2019) Travel time estimation without road networks: an urban morphological layout representation approach. arXiv preprint arXiv:1907.03381. Cited by: §2.1.
  • [13] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §1.
  • [14] Y. Li, K. Fu, Z. Wang, C. Shahabi, J. Ye, and Y. Liu (2018) Multi-task representation learning for travel time estimation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1695–1704. Cited by: §1, §1, §2.1.
  • [15] M. Luong, H. Pham, and C. D. Manning (2015) Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025. Cited by: §2.2.
  • [16] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T. Liu (2019) Fastspeech: fast, robust and controllable text to speech. In Advances in Neural Information Processing Systems, pp. 3165–3174. Cited by: §2.2.
  • [17] J. Salazar, K. Kirchhoff, and Z. Huang (2019) Self-attention networks for connectionist temporal classification in speech recognition. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7115–7119. Cited by: §2.2.
  • [18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: 2nd item, §2.2, §3.2.
  • [19] D. Wang, J. Zhang, W. Cao, J. Li, and Y. Zheng (2018) When will you arrive? estimating travel time based on deep neural networks. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1, §1, §2.1.
  • [20] H. Wang, X. Tang, Y. Kuo, D. Kifer, and Z. Li (2019) A simple baseline for travel time estimation using large-scale trip data. ACM Transactions on Intelligent Systems and Technology (TIST) 10 (2), pp. 1–22. Cited by: §2.1.
  • [21] Y. Wang, Y. Zheng, and Y. Xue (2014) Travel time estimation of a path using sparse trajectories. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 25–34. Cited by: §1, §1.
  • [22] Z. Wang, K. Fu, and J. Ye (2018) Learning to estimate the travel time. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 858–866. Cited by: §1, §1, §2.1, §3.1, §3.
  • [23] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057. Cited by: §2.2.
  • [24] A. W. Yu, D. Dohan, M. Luong, R. Zhao, K. Chen, M. Norouzi, and Q. V. Le (2018) Qanet: combining local convolution with global self-attention for reading comprehension. arXiv preprint arXiv:1804.09541. Cited by: §2.2.
  • [25] J. Yuan, Y. Zheng, X. Xie, and G. Sun (2011) T-drive: enhancing driving directions with taxi drivers’ intelligence. IEEE Transactions on Knowledge and Data Engineering 25 (1), pp. 220–232. Cited by: §2.1.
  • [26] X. Zhan, S. Hasan, S. V. Ukkusuri, and C. Kamga (2013) Urban link travel time estimation using large-scale taxi data with partial information. Transportation Research Part C: Emerging Technologies 33, pp. 37–49. Cited by: §2.1.
  • [27] F. Zhang, X. Zhu, T. Hu, W. Guo, C. Chen, and L. Liu (2016)

    Urban link travel time prediction based on a gradient boosting method considering spatiotemporal correlations

    ISPRS International Journal of Geo-Information 5 (11), pp. 201. Cited by: §1, §1.
  • [28] H. Zhang, H. Wu, W. Sun, and B. Zheng (2018) Deeptravel: a neural network based travel time estimation model with auxiliary supervision. arXiv preprint arXiv:1802.02147. Cited by: §1, §2.1.
  • [29] Z. Zhu, W. Wu, W. Zou, and J. Yan (2018) End-to-end flow correlation tracking with spatial-temporal attention. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 548–557. Cited by: §2.2.