UST: Unifying Spatio-Temporal Context for Trajectory Prediction in Autonomous Driving

05/06/2020 ∙ by Hao He, et al. ∙ 0

Trajectory prediction has always been a challenging problem for autonomous driving, since it needs to infer the latent intention from the behaviors and interactions from traffic participants. This problem is intrinsically hard, because each participant may behave differently under different environments and interactions. This key is to effectively model the interlaced influence from both spatial context and temporal context. Existing work usually encodes these two types of context separately, which would lead to inferior modeling of the scenarios. In this paper, we first propose a unified approach to treat time and space dimensions equally for modeling spatio-temporal context. The proposed module is simple and easy to implement within several lines of codes. In contrast to existing methods which heavily rely on recurrent neural network for temporal context and hand-crafted structure for spatial context, our method could automatically partition the spatio-temporal space to adapt the data. Lastly, we test our proposed framework on two recently proposed trajectory prediction dataset ApolloScape and Argoverse. We show that the proposed method substantially outperforms the previous state-of-the-art methods while maintaining its simplicity. These encouraging results further validate the superiority of our approach.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the great development of deep learning techniques in recent years, the perception system equipped in autonomous driving system has been significantly advanced. However, the difficulty of another equally important task, predicting the future trajectories for traffic participants in real-world scenarios is still underestimated, since trajectory prediction task requires understanding the latent intention from the behaviors and interactions of participants.

The key to this challenging task is to model the complicated spatio-temporal social context and tolerate the imperfect output from perception systems. For the first task, the difficulty lies in that we can only infer the intention from indirect observations. In a certain circumstance, different drivers or pedestrians may make distinct decisions. On the other hand, even for a certain driver or pedestrian, its behavior may be easily affected by diverse interactions at different places and different time (see Fig. 1(a)

). For the second challenge, we cannot expect an oracle perception system due to occlusion and limited range of sensors. Common mistakes include trajectory interruption, unstable speed estimation, etc. Existing works always make the strong assumption that each neighbor has a fix length trajectory and highly depend on accurate speed feature. Some works even eliminate those agents who have incomplete history

[13, 9]. However, these are just the common scenarios we meet in real autonomous driving systems.

(a) Illustration
(b) Traditional Representation
(c) Unified 3D Representation
Fig. 1: Illustration and representations of the trajectory prediction task. Blue, green, red colors show trajectories for vehicles, bicycles, pedestrians respectively. (b) shows the common representation, which represents the surrounding agents as sequences of positions in 2D spatial space. (c) shows our proposed trajectory representation in a unified spatio-temporal space.

Most learning-based trajectory prediction methods can be categorized into the following encoder-decoder framework: they usually consist of spatio-temporal context encoder and future trajectory decoder. In the first module, we want to utilize all the available history information to model the intention of agents. The information we can use include the history trajectory of one single agent (temporal context) and the interaction between agents in one single step (spatial context). Other meta-data such as maps or traffic rules can also be incorporated. While in the second module, given the encoded context, we need to generate the future trajectory of the agent. The future trajectory can be represented either by a deterministic path [32, 22], a path with uncertainty estimation [1, 29] or several sampling paths [13, 33].

Most existing spatio-temporal context encoders encode spatial context and temporal context separately. They typically extract each agent’s temporal context with a recurrent neural network (RNN). After that, a spatial context extractor (e.g., convolution [9], attention [26], pooling [13], graph network [18]) will be applied to further aggregate the spatial context. The spatial information between agents is considered at the last time step, thus the interaction among agents in previous time steps are all abandoned. To improve upon these methods, recent works [29, 16] have proposed to leverage spatio-temporal graph. The basic idea of these methods is to extracting spatial context at each time step, and then fuse it into individual temporal features for next time aggregation. However, they still extract spatio-temporal context in this cascade style (first temporal, then spatial) essentially. On one hand, extracting spatial context at each time step can be time-consuming (reported in [13]

as 16x slower); on the other hand, it cannot model complex spatial context across different time step. There is also an intrinsic problem involved with RNN-based temporal context extractor. When existing missing data, either eliminating the whole history or padding with interpolations is necessary. All these drawbacks call for a unified spatio-temporal modeling method for trajectory prediction.

In this paper, we propose a novel method to address the above challenges by Unifying the Spatial and Temporal context  () into one single representation. The core idea of  is to jointly represent the spatio-temporal context in a higher dimensional space, which does not distinguish these two concepts explicitly. Then the status of a certain agent at a certain time is represented as a point in this space. (See Fig. 1(c) for illustration.) Other meta-data can also be easily incorporated. As a result, the spatio-temporal context becomes the distribution of these unordered points in the new space. Next, inspired by the classical work PointNet [28], we devise a novel encoder structure for joint context modeling. Based on the extracted context, various off-the-shelf decoders can be used to incorporate different purposes from subsequent modules. To summarize, our main contributions are as follows:

  • We first propose to unify 2D location and discrete time space into one single 3D space. We treat them equally.

  • Based on this representation, we devise a simple and effective network to encode the spatio-temporal context. The method can be easily implemented within ten lines of codes.

  • Experiments on three major trajectory prediction datasets demonstrates the effectiveness of UST.

Ii Related Work

In this section, we will give a brief review of the history and recent researches about trajectory prediction problem.

The idea of predicting 2D future trajectories using spatio-temporal context could date back to the work of Helbing and Molnar [14], who proposed a model with attractive and repulsive Soical Force. It has achieved remarkable successes in robotics[24, 12] and activity understanding[30, 5]. Recently, many researchers extended these interaction-aware trajectory prediction methods to a variety of areas with advanced deep learning techniques[1, 13, 32, 33], including pedestrians trajectory prediction[1], vehicles trajectory prediction[9], intention prediction[10] and heterogeneous prediction[25]. Despite they are dealing with various tasks with different approaches, we summarize most deep learning based methods into the spatio-temporal context encoder and future trajectory decoder framework.

Spatio-temporal Context Encoder

The raw data of trajectory prediction task consists of both spatial and temporal information and various metadata. They cannot be directly processed by off-the-shelf neural networks. Popular representation forms include sequential points-based [13, 29, 16], occupy gird-based [17, 9], displacement volume [31, 23] and rasterized image [7, 11]. Directly representing trajectory as a sequence of 2D location is the most natural way, however its unstructured characteristics make context extraction hard. Although all other representations preserve the structural information, their structures are hand-crafted thus need deliberately tuning, which makes the generalization across different scenarios infeasible. Moreover, these hand-crafted structures are sensitive to the quality of perception output. If the quality of the output of upstream modules changes, the structure may need to be redesigned. Consequently, how to adaptively build the structure from data still remains an open problem.

Based on these representations, researchers have devised various networks to extract the spatio-temporal context. However, all existing work extract the temporal context and spatial context individually, and aggregate them in a cascade way. For the temporal context, Recurrent Neural Network (RNN) is the most widely used. There are also some other algorithms that use Convolution Neural Network (CNN) to encode temporal context

[27, 22]. For the spatial context (a.k.a “social context”), its extaction remains to a hot research topic in recent years. Various methods including convolution[9], pooling[1, 13], attention network[29], graph neural network[16], relation network[4] have been used in past few years. Instead of manually building the spatial partition and then encoding the temporal context and spatial context separately, our proposed  unifies the spatio-temporal space and learn to partition this joint space end-to-end.

Future Trajectory Generation The task of the generator or decoder is to generate the future position of the target agent based on both the ego information and the encoded context information. The most common one is to use a recurrent neural network (RNN) to regress the future trajectory directly[16, 32]. There are also other variants based on RNN to accommodate other demands from downstream modules. To name a few, Gupta et al.[13] incorporated noise to the RNN decoder and train it with a discriminator and variety loss to generate diverse socially acceptable trajectories. Deo et al. [9]

tried to classify maneuvers firstly then use the classification result to construct a Gaussian mixture model for multi-modal trajectory prediction. Chai

et al. [2] classified trajectories to the pre-clustered anchors instead of predefined maneuvers. In this work, decoder is not our focus. Thus we directly utilize the off-the-shelf decoders in our proposed method.

Iii Method

This section gives details of our proposed method. Subsection III-A formalizes the trajectory prediction problem mathematically. Subsection III-B details how to encode the spatial-temporal context in a unified framework. Subsection III-C elaborates how we use the encoded spatio-temporal context representation to generate future trajectories.

Iii-a Problem Definition

Let’s assume that there are agents and time steps in total, and denotes the raw 2D spatial location of agent at time step with respect to a predefined reference frame. We can further formulate the history of one single agent as , where refers to history locations of agent starting from time to time , while refers to future locations of agent starting from time to time . is the additional information such as traffic-agent type, map information, taillight, heading, etc. The goal of trajectory prediction is to utilize all and to accurately predict .

Iii-B Spatio-temporal Context Encoder

In this subsection, we elaborate details of our spatio-temporal context encoder. We first introduce the input representation of our method, then followed by the encoder structure.

Iii-B1 Input Representation

For the sake of simplicity and flexibility, we design spatio-temporal point sets to represent the raw input (sequence of locations of multiple agents) in the following form:


where are 2D locations and 2D velocity of agent in current reference frame, denotes some intrinsic properties of agent such as type. refers to the time step. And is a binary number, 0 is for neighbor agents, and 1 for the target agent to be predicted. We don’t distinguish different neighborhood agents, and treat them equally.

After applying the above input representation module, we can treat the snapshot of status of agent at time step as a single point with metadata in a 3D space spanned by 2D location and time. An intuitive illustration is depicted in Fig. 1(c). Note that all the information we need to model the social interaction of the target agent is presented in . One desired property of this representation is that it is invariant to the order of each in it and robust to missing data. By this uniform representation, we unify space and time into one representation, which eases the subsequent context modeling task.

Iii-B2 Unified Spatio-Temporal Context Extraction

To deal with such unordered and variable length data, the structure and operations of this feature extractor should be deliberately designed to fit the nature of the data. Fortunately, the same challenge has been met in the area of point cloud processing. Inspired by PointNet [28], we propose the following two key components for context extraction.


The goal of this embedding network is to map

into a hidden representation

, in which the spatial context and temporal context are unified:


In our implementation,

is implemented by a multiple layer perceptron (MLP).

is the embedding weight. Batch normalization is applied over all layers with ReLU activation functions. Although there are several works that also apply MLP on time series prediction

[19], they treat the whole time series as a fixed-length chronological data instead of a set of permutation invariant snapshots.

Permutation Invariant Aggregator

After embedding each , we need a permutation invariant aggregator to form the global context feature.  [28]

has shown that simple pooling operator is capable for the task. It is the simplest symmetric function that enjoys this property. By default, we use max pooling as the aggregator.

Each dimension in the global feature actually corresponds to a partition in the feature space. Each of them represents one configuration of spatio-temporal context. And this partition is learnt end-to-end from the data. To intuitively understand these operation, we visualize several examples in Fig. 2 with NGSIM [6] dataset.

We randomly choose two dimensions from the global context feature, and find the most influencing positions in this 3D space. We further pickup some typical cases and present their activation of this dimension in these figures. For the case in the Fig. 2(a), most points spread in the left front and right front area. Observing the scenarios in Fig. 2(b) - Fig. 2(e), we can find that the scenario with more agents appearing in left front and right front has larger value in the corresponding dimension of the global feature. Fig. 2(f) - Fig. 2(j) demonstrates another case that the agent should focus on past status of the front vehicle. Examining the high activation of the cases, we can find it actually corresponds to the decelerating of front vehicle. It is clearly seen that the learned partition of spatio-temporal space by our proposed method coincides with human interpretation.

(a) Partition for left front and right front area
(b) 8.62
(c) 6.21
(d) 1.38
(f) Partition for deceleration of front vehicle
(g) 7.53
(h) 6.68
(i) 3.57
(j) 1.02
(e) 0.34
(e) 0.34
Fig. 2:

(a) and (f) show activation patterns of two typical neurons in the pooled spatio-temporal features. The number in other subfigures indicates the value of activation of this neuron of the case.

Recursive Refinement

Although this aggregator is capable to aggregate the overall environment around the target agent, the pooling operator is still a first-order aggregator. It fails to model the second-order information such as interactions between agents. As a simple solution, we concatenate the global context feature to every individual feature, and recursively apply the aforementioned steps. In the second step, the embedding is aware of the status of individual agent and all the global context, thus could capture the interactions.

We summarize the pseudocodes of our encoder structure in Alg. 1. It can be easily implemented with several lines of code.


        def embedding(x, n_layers):
            codeblue# x: [N, K, Q] => [N, K, shape of weight]
            for layer in range(n_layers):
                x = Linear(x)
                x = BatchNorm(x)
                x = ReLU(x)
            return x
        def STPooling(x):
␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣␣snapshot’s dimension Q,
                context: [N, K]

 pooling: max pooling; cat: concatenation.

Algorithm 1

Pseudocode of Spatio-temporal pooling in a PyTorch-like style.

Iii-C Future Trajectory Decoder

The decoder should be designed to satisfy various demands from downstream modules like planning. We believe that when the spatio-temporal context is well modeled, off-the-shelf decoder can still achieve superior results. As a default option, we feed the encoded spatio-temporal feature into a standard LSTM. At each time step, we minimize the loss between predicted 2D positions and the ground-truth. If desired, it is easy to extend to a stochastic decoder. We could simply inject a zero-mean Gaussian noise to the encoded feature in each decoding process, and using variety loss as supervision as in[13, 33, 16].

In section IV, extensive experiments show that our method can achieve state-of-the-art performance in both deterministic and stochastic prediction tasks despite the simplicity of our decoder structure.

Iii-D Discussion

In this section, we highlight several key advantages of our proposed method, and compare with other state-of-the-art method.

RNN free

One benefit is that we do not rely on a fixed-length history trajectory, thus we can better handle the missing data due to the false negative of detection or tracker interruption. In contrast, current RNN based methods [13, 9] have to eliminate the incomplete history of an agent because RNN can not differentiate the exact time step of the remaining trajectory, which reduces the amount of effective context one can utilize when modeling the behavior of the target agent. Second, RNN-based encoder for varying length trajectories is hard to be batch parallelled. Padding dummy values can be a solution, but it will introduce noise and affect performance. Besides, RNN-based encoder does not work when the time gap of the history sequence is not fixed, which is usually the case in real world autonomous driving scenarios because of the latency of the upstream fusion and perception modules. On the contrary, the time dimension in our input representation can be a dynamic floating point, which is much more flexible.

Comparisons with other social pooling methods

Although there are several works that also use pooling module to model social interaction [1], their methods and motivation are different from us. Their method needs to partition the space manually, and compute additional occupancy grids for each agent, while ours has already embedded and partitioned both space and time adaptively thanks to the unified spatio-temporal representation. Therefore, a simple global pooling is enough.

The importance of modeling cross time step social interaction

People need the reaction time to handle the sudden circumstance on the road. For example, there exists a delay before one can step on brakes when the front car slows down. In other words, an agent’s behavior may be affected by the actions of other agents in previous time steps. Thus, extracting the cross-time step interactions among agents becomes necessary.

Iv Experiment

We compare our proposed method with other state-of-the-art methods in two recently published trajectory datasets: ApolloScape [15, 25] and Argoverse [3]. Both of these datasets are collected from the first-person perspective by sensor-equipped acquisition cars in real-world driving scenarios, but they focus on different problems in the trajectory prediction field. We use the official metrics for these datasets and compare our proposed   with the state-of-the-arts. In addition, we construct variants of our method for ablation study and qualitative analysis to show how   can model complex spatio-temporal context in the high dimensional space.

Iv-a Dataset and Experiment Setting

In this section, we detail the datasets as well as the corresponding experiment settings.

(a) ADE (3s) (b) FDE (3s)
Methods vehicles pedestrians bicycles weighted vehicles pedestrians bicycles weighted
StarNet[34] 2.38 0.78 1.86 1.34 4.28 1.51 3.46 2.49
TrafficPredict[25] 7.94 7.18 12.88 8.58 12.77 11.12 22.79 24.22
LSTM 2.88 0.94 2.09 1.58 5.25 1.84 3.87 2.97
CV 2.59 0.81 2.17 1.47 4.64 1.58 4.02 2.73
2.10 0.75 1.77 1.24 3.65 1.44 3.14 2.25
TABLE I: Performance on ApolloScape trajectory dataset. All results are recorded from ApolloScape public leaderboard. Our method rank first in all public methods currently. (a) shows the average distance error (ADE) of different algorithms on diverse traffic-agents. (b) shows the final distance error (FDE) in 3 second.
ApolloScape [15]

ApolloScape dataset is recorded in urban streets by various sensors (e.g. LiDAR, radar, camera). This dataset provides 3 seconds as trajectory history and aims to predict the next 3 seconds at 0.5s interval. The scene consists of heterogeneous traffic agents, including pedestrians, vehicles and bicycles. So the spatio-temporal context is much more complex to model. To give a fair comparison for this heterogeneous traffic-agent dataset, we follow the same experiment setting in ApolloScape challenge 111ApolloScape Challange We submit our result to the challenge leaderboard, and compare it with other submitted methods. To measure the performance of algorithms, we report the prediction error of each type of traffic-agents. The main metrics used for this dataset are Average Distance Error (ADE) and Final Distance Error (FDE).

  • ADE: mean Euclidean distance between predicted coordinates and the ground truth over all time steps.

  • FDE: Euclidean distance between the predicted coordinates and the ground truth at the final prediction timestep.

Based on these metrics, we report a weighted sum of ADE (WSADE) and weighted sum of FDE (WSFDE) by assigning coefficient to pedestrians, vehicles, bicycles respectively as in the ApolloScape Challenge.


WSADE is also the metric by which the ApolloScape challenge is ranked. Because the difference of intrinsic behavior between pedestrians and vehicles, we separate the model for pedestrians by increasing the pedestrians’ weight in the loss function.

Argoverse [3]

Argoverse is a large-scale autonomous driving dataset, containing 320 hours of data as well as rich map information. The given vector map is a semantic graph that provides detailed lane information that an agent might follow. Therefore we can get the multi-modal future trajectories explicitly with the help of multiple candidate lane centerlines. The collected trajectories in this dataset are individual 5 seconds trajectory segments. The first 2 seconds are used as history and predict spatial locations of the vehicles for up to 5 seconds. The traffic-agent type of Argoverse is not provided, so it is not included in our input representation.

We follow the official metric to benchmark multiple predictions on this dataset. The metric we choose is Minimum over N (MoN) metric as in previous works [13, 21]. It computes the error between the ground truth and the closet trajectory provided in the N output predictions. Specifically, we evaluate the top-N ADE and FDE with . To generate multi-modal future trajectory, we use the basic implementation offered by Argoverse baseline222Argoverse baseline implementation: In training, we choose a 2-d curvilinear coordinate system with axes tangential and perpendicular to the most possible centerline of the trajectory. At inference, we generate diverse future trajectories by using a different centerline as the 2-d curvilinear coordinate system. With multiple candidate centerlines, we can define various origins and reference frames to predict diverse futures.


Besides real-world autonomous driving dataset, we also evaluate our method on a widely used trajectory dataset Next Generation Simulation (NGSIM) dataset [6]. It consists of 45 minutes of highway driving trajectories at 10Hz for each roadway and contains various traffic conditions and diverse interactions among different traffic-agents. Note that trajectories in this dataset are recorded from fixed bird-eye view cameras. It can observe traffic agents without any coverage or missing data, which is inconsistent with the real autonomous driving scenarios. Thus we only list the results for reference. We adopt the same experiment setting as [9] to use as history and predict

for future trajectory. The evaluation metric is root mean square error (RMSE) in meters over all future timesteps.

Iv-B Implementation Details

We set the filter number of the fully connected layer to 128. Also the dimension of the hidden state for decoder LSTM keeps the same too. We iteratively train the network with a batch size of 128 for 50 epochs using Adam with an initial learning rate of 0.0003. We set the weight decay to 0.0001. The basic input of our model is 2D location, velocity, discrete time, and ID. We also add the type of agents on the ApolloScape dataset since in urban scenarios usually mixed types of agents are presented.

Iv-C Results

We compare our proposed method with several state-of-the-art algorithms and baseline models in this section.

Fig. 3: Visualization of diverse trajectory predictions on Argoverse dataset. With basic implementation, we can still generate multiple diverse trajectory using the well modeled spatio-temporal context.

State-of-the-art algorithms on ApolloScape and baseline models are shown as follows. The results are excerpted from ApolloScape leaderboard to make sure a fair comparison.

  • StarNet[34]: The champion of trajectory prediction challenge in “CVPR 2019 Workshop on Autonomous Driving — Beyond Single Frame Perception”, which utilizes a centralized hub network to model spatio-temporal contexts with low time complexity.

  • TrafficPredict[25]: An LSTM-based algorithm which focuses on predicting trajectories for heterogeneous traffic-agents in urban environment.

  • LSTM: A simple LSTM encoder-decoder which only models each agent’s temporal context independently.

  • CV(constant velocity): A baseline which uses a constant velocity Kalman filter to predict trajectories in the future.

In Table I, the results show that our proposed method achieves best performance in every metric and every agent type. Especially, compared with the current state-of-the-art method StarNet which is an LSTM-based attention network, we observe a 0.63m improvement in FDE for the vehicle prediction task. Although they have leveraged a hub network to model agents’ interactions, the encoded context is still worse than ours. Our method improves considerably compared with all other methods of all three kinds of traffic agents.

Methods ADE FDE
NN + map 2.28 4.80
LSTM + map 2.25 4.67
LSTM + social + map 2.46 4.67
1.47 2.94
TABLE II: Evaluations of  and multiple baselines on the Argoverse dataset. Error is minimum over six samples.

We compare our method with several baselines from Argoverse.

  • Nearest Neighbor with map (NN + map): Weighted Nearest Neighbor regression trajectories where trajectories are queried by coordinate in the curvilinear coordinate system.

  • LSTM + map: standard LSTM encoder-decoder structure, which uses map information to define multiple curvilinear coordinate systems. Therefore, it can generate multiple possible future trajectories.

  • LSTM + social + map : Similar to LSTM + map, but with additional hand-crafted social features. Social features include minimum distance to the vehicle in front and in back, and number of neighbors.

Table II shows that  outperforms all baselines on both ADE and FDE metrics. We also provide qualitative results to visualize how  can predict multiple possible future trajectories according to maps (as shown in Fig. 3). Once the spatio-temporal context is well modeled, we can easily generate multiple multi-modal trajectories with additional map information under different complex conditions.


We report the performance of  and several state-of-the-art algorithms on the NGSIM dataset (see Table. III). Only deterministic version result is reported. In addition, we construct a variant of our method denoted as -180 which doubles the context range by enlarging the longitudinal direction range from to .

Without any doubt,  outperforms any other algorithm in the same setting. In particular, the results of improves dramatically. It outperforms the standard version, especially in the result of . We believe that it is owe to the richer spatio-temporal context provided, and our unified modeling method could digest these information seamlessly.

Time CV CV-GMM[8] GAIL-GRU[20] LSTM MATF [33] CS-LSTM [9] S-LSTM [1] -180
1 0.73 0.66 0.69 0.68 0.67 0.61 0.65 0.58 0.56
2 1.78 1.56 1.56 1.65 1.51 1.27 1.31 1.20 1.15
3 3.13 2.75 2.75 2.91 2.51 2.09 2.16 1.96 1.82
4 4.78 4.24 4.24 4.46 3.71 3.10 3.25 2.92 2.58
5 6.68 5.99 5.99 6.27 5.12 4.37 4.55 4.12 3.45
TABLE III: Comparison with different methods. Root mean square errors (RMSE) in meters from to are reported.

Iv-D Ablation Study

In this section, we design several variants of  to investigate how different setting influences the final performance. All methods are evaluated on Argoverse validation set. The metric in ablation study is top-N ADE and FDE with N=1 to control all possible factors except the encoder.

Input Representation

We study the feature used in the input representation of . Table. IV summeraizes the results. Note that the velocity comes from the differential of the provided position. As can be noticed, velocity only brings in marginal improvement. We owe this observation to that   can model the relationship between different agents across different time steps, thus speed feature has already been implicitly learned. This property is especially useful when the observations from upstream are noisy, in which the results of explicit differential are not reliable.

Position Time Velocity 3s ADE 3s FDE
3.46 7.60
2.71 6.01
2.68 5.97
TABLE IV: Results of ablation studies on the Argoverse validation set.
Number of Times of Iterative Refinement

We also conduct experiments to investigate the time of recursive refinement. The results in Fig. 4 show that when the time of refinement exceeds two, the performance becomes saturated. This may suggest that the gain of modeling even higher-order interactions more than two is marginal.

Fig. 4: Results on the Argoverse validation set under the different time of iterative refinement.

V Conclusion

In this paper, we have presented  for trajectory prediction problem, which integrates 2D locations and discrete time space into one unified 3D space, then learn the spatio-temporal context end-to-end. Despite the simplicity, it still shows state-of-the-art performance on various trajectory prediction datasets. We hope our method could be a strong baseline to trajectory prediction field and these encouraging results could inspire more advanced methods on the spatio-temporal context representation.


  • [1] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese (2016) Social LSTM: Human trajectory prediction in crowded spaces. In CVPR, pp. 961–971. Cited by: §I, §II, §II, §III-D, TABLE III.
  • [2] Y. Chai, B. Sapp, M. Bansal, and D. Anguelov (2019) MultiPath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction. arXiv preprint arXiv:1910.05449. Cited by: §II.
  • [3] M. Chang, J. Lambert, P. Sangkloy, J. Singh, S. Bak, A. Hartnett, D. Wang, P. Carr, S. Lucey, D. Ramanan, et al. (2019) Argoverse: 3D tracking and forecasting with rich maps. In CVPR, pp. 8748–8757. Cited by: §IV-A, §IV.
  • [4] C. Choi and B. Dariush (2019) Looking to relations for future trajectory forecast. arXiv preprint arXiv:1905.08855. Cited by: §II.
  • [5] W. Choi and S. Savarese (2013) Understanding collective activities of people from videos. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (6), pp. 1242–1257. Cited by: §II.
  • [6] J. Colyar and J. Halkias (2007) US highway 101 dataset. Federal Highway Administration (FHWA), Tech. Rep. FHWA-HRT-07-030. Cited by: §III-B2, §IV-A.
  • [7] H. Cui, V. Radosavljevic, F. Chou, T. Lin, T. Nguyen, T. Huang, J. Schneider, and N. Djuric (2019) Multimodal trajectory predictions for autonomous driving using deep convolutional networks. In ICRA, pp. 2090–2096. Cited by: §II.
  • [8] N. Deo, A. Rangesh, and M. M. Trivedi (2018) How would surround vehicles move? A unified framework for maneuver classification and motion prediction. IEEE Transactions on Intelligent Vehicles 3 (2), pp. 129–140. Cited by: TABLE III.
  • [9] N. Deo and M. M. Trivedi (2018) Convolutional social pooling for vehicle trajectory prediction. In CVPRW, pp. 1468–1476. Cited by: §I, §I, §II, §II, §II, §II, §III-D, §IV-A, TABLE III.
  • [10] W. Ding, J. Chen, and S. Shen (2019) Predicting vehicle behaviors over an extended horizon using behavior interaction network. arXiv preprint arXiv:1903.00848. Cited by: §II.
  • [11] N. Djuric, V. Radosavljevic, H. Cui, T. Nguyen, F. Chou, T. Lin, and J. Schneider (2018) Motion prediction of traffic actors for autonomous driving using deep convolutional networks. arXiv preprint arXiv:1808.05819. Cited by: §II.
  • [12] G. Ferrer, A. Garrell, and A. Sanfeliu (2013) Robot companion: A social-force based approach with human awareness-navigation in crowded environments. In IROS, pp. 1688–1694. Cited by: §II.
  • [13] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi (2018) Social GAN: Socially acceptable trajectories with generative adversarial networks. In CVPR, pp. 2255–2264. Cited by: §I, §I, §I, §II, §II, §II, §II, §III-C, §III-D, §IV-A.
  • [14] D. Helbing and P. Molnar (1995) Social force model for pedestrian dynamics. Physical Review E 51 (5), pp. 4282. Cited by: §II.
  • [15] X. Huang, X. Cheng, Q. Geng, B. Cao, D. Zhou, P. Wang, Y. Lin, and R. Yang (2018) The ApolloScape dataset for autonomous driving. In CVPRW, pp. 954–960. Cited by: §IV-A, §IV.
  • [16] Y. Huang, H. Bi, Z. Li, T. Mao, and Z. Wang (2019) STGAT: Modeling spatial-temporal interactions for human trajectory prediction. In ICCV, pp. 6272–6281. Cited by: §I, §II, §II, §II, §III-C.
  • [17] B. Kim, C. M. Kang, J. Kim, S. H. Lee, C. C. Chung, and J. W. Choi (2017) Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network. In ITSC, pp. 399–404. Cited by: §II.
  • [18] V. Kosaraju, A. Sadeghian, R. Martín-Martín, I. Reid, S. H. Rezatofighi, and S. Savarese (2019) Social-BiGAT: Multimodal trajectory forecasting using bicycle-gan and graph attention networks. arXiv preprint arXiv:1907.03395. Cited by: §I.
  • [19] T. Koskela, M. Lehtokangas, J. Saarinen, and K. Kaski (1996) Time series prediction with multilayer perception, fir and elman neural networks. . External Links: Link Cited by: §III-B2.
  • [20] A. Kuefler, J. Morton, T. Wheeler, and M. Kochenderfer (2017) Imitating driver behavior with generative adversarial networks. In IV, pp. 204–211. Cited by: TABLE III.
  • [21] N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chandraker (2017) Desire: Distant future prediction in dynamic scenes with interacting agents. In CVPR, pp. 336–345. Cited by: §IV-A.
  • [22] X. Li, X. Ying, and M. C. Chuah (2019) GRIP: Graph-based interaction-aware trajectory prediction. arXiv preprint arXiv:1907.07792. Cited by: §I, §II.
  • [23] Y. Li (2019) Which way are you going? Imitative decision learning for path forecasting in dynamic scenes. In CVPR, pp. 294–303. Cited by: §II.
  • [24] M. Luber, J. A. Stork, G. D. Tipaldi, and K. O. Arras (2010) People tracking with human motion predictions from social forces. In ICRA, pp. 464–469. Cited by: §II.
  • [25] Y. Ma, X. Zhu, S. Zhang, R. Yang, W. Wang, and D. Manocha (2019) Trafficpredict: Trajectory prediction for heterogeneous traffic-agents. In AAAI, pp. 6120–6127. Cited by: §II, 2nd item, TABLE I, §IV.
  • [26] K. Messaoud, I. Yahiaoui, A. Verroust-Blondet, and F. Nashashibi (2019) Non-local social pooling for vehicle trajectory prediction. Cited by: §I.
  • [27] N. Nikhil and B. Tran Morris (2018) Convolutional neural network for trajectory prediction. In ECCV, Cited by: §II.
  • [28] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017) Pointnet: Deep learning on point sets for 3D classification and segmentation. In CVPR, pp. 652–660. Cited by: §I, §III-B2, §III-B2.
  • [29] A. Vemula, K. Muelling, and J. Oh (2018)

    Social attention: Modeling attention in human crowds

    In ICRA, pp. 1–7. Cited by: §I, §I, §II, §II.
  • [30] K. Yamaguchi, A. C. Berg, L. E. Ortiz, and T. L. Berg (2011) Who are you with and where are you going?. In CVPR, pp. 1345–1352. Cited by: §II.
  • [31] S. Yi, H. Li, and X. Wang (2016) Pedestrian behavior understanding and prediction with deep neural networks. In CVPR, pp. 263–279. Cited by: §II.
  • [32] P. Zhang, W. Ouyang, P. Zhang, J. Xue, and N. Zheng (2019) SR-LSTM: State refinement for LSTM towards pedestrian trajectory prediction. In CVPR, pp. 12085–12094. Cited by: §I, §II, §II.
  • [33] T. Zhao, Y. Xu, M. Monfort, W. Choi, C. Baker, Y. Zhao, Y. Wang, and Y. N. Wu (2019)

    Multi-agent tensor fusion for contextual trajectory prediction

    In CVPR, pp. 12126–12134. Cited by: §I, §II, §III-C, TABLE III.
  • [34] Y. Zhu, D. Qian, D. Ren, and H. Xia (2019) StarNet: Pedestrian trajectory prediction using deep neural network in star topology. arXiv preprint arXiv:1906.01797. Cited by: 1st item, TABLE I.