The Simpler the Better: Constant Velocity for Pedestrian Motion Prediction

03/19/2019 ∙ by Christoph Schöller, et al. ∙ 0

Pedestrian motion prediction is a fundamental task for autonomous robots and vehicles to operate safely. In recent years many complex models have been proposed to address this problem. While complex models can be justified, simple models should be preferred given the same or better performance. In this work we show that a simple Constant Velocity Model can achieve competitive performance on this task. We evaluate the Constant Velocity Model using two popular benchmark datasets for pedestrian motion prediction and show that it outperforms state-of-the-art models and several common baselines. The success of this model indicates that either neural networks are not able to make use of the additional information they are provided with, or it is not as relevant as commonly believed. Therefore, we analyze how neural networks process this information and how it impacts their predictions. Our analysis shows that neural networks implicitly learn environmental priors that have a negative impact on their generalization capability, most of the pedestrian's motion history is ignored and interactions - while happening - are too complex to predict. These findings explain the success of the Constant Velocity Model and lead to a better understanding of the problem at hand.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The accurate prediction of pedestrians’ future motion is an essential capability in a diverse set of scenarios. Autonomous robots need to navigate in highly populated and changing environments [7, 4] and execute their assigned tasks while avoiding collisions and endangering humans. Furthermore, the emergence of autonomous vehicles requires to understand and robustly predict the motion of pedestrians in urban traffic scenarios [27, 16, 13].

Recently, increasingly complex models have been proposed to address this problem [20, 29, 12, 1, 31]

. Many of these contributions utilize neural networks due to their success in other problem domains like image recognition or natural language processing. While complex models can be justified, a simpler model should be preferred if it achieves competitive performance.

Figure 1: Predictions of the Constant Velocity Model on a scene from the UCY Uni data.

The contribution of this work is the insight that – despite its simplicity – a Constant Velocity Model (CVM) can achieve state of the art performance for pedestrian motion prediction. We present an extensive evaluation on two well-known benchmark datasets and compare the CVM with popular baselines and three state-of-the-art models based on neural networks. The success of this simple approach hints that much information provided to neural networks is either not used, or is less relevant than commonly believed. For this reason we analyze how neural networks process this information and its impact on their predictions. In particular, we analyze:

  • Environmental Priors. Physical constraints and environmental semantics bias pedestrian motion towards certain patterns. We show that neural networks implicitly learn such an environmental prior that has a strong negative impact on their generalization to new scenes.

  • Motion History. It is common belief that taking longer motion histories into account leads to more accurate predictions of the future. However, our analysis demonstrates that most of this information is ignored and depriving a neural network of the longer past does not lead to prediction degradation.

  • Pedestrian Interactions. Interactions between pedestrians happen and our experiments show that neighborhood information can potentially improve predictions. However, only knowledge of the neighbors’ motion histories is not sufficient for accurate interaction-aware predictions.

2 Related Work

The prediction of pedestrian motion has been addressed from various perspectives. In tracking algorithms, motion prediction is important for robust statistical filtering. To track people in images, Baxter et al. [5]

extend a Kalman Filter with an instantaneous prior belief about where people will move, based on where they are currently looking at. Instead of images, Cui et al. 

[9] and Leigh et al. [21] presented algorithms to track pedestrians feet in laser scans. However, tracking only requires short-term motion predictions, whereas our contributions concentrates on long-term predictions.

To make long-term predictions, Becker et al. [6]

use a recurrent encoder with a multilayer perceptron and achieve good results. In contrast, our approach is much simpler and not based on neural networks. Other contributions also take pedestrian interactions into account. To model these interactions, Helbing et al. 

[14] propose the use of attractive and repulsive social forces. This approach was later extended and transferred to the prediction of pedestrians for autonomous robots [24]. Later interaction-awareness has been integrated in neural networks. Alahi et al. [1] train LSTMs for pedestrian motion prediction and share information about the pedestrians through a social pooling mechanism. This mechanism gathers the hidden states of nearby pedestrians with a pooling grid. Xu et al. [33] propose the Crowd Interaction Deep Neural Network that computes the spatial affinity between pedestrians’ last locations to weight the motion features of all pedestrians for location displacement prediction. Vemula et al. [31]

address interaction-aware motion prediction with their Social Attention model by capturing the relative importance between pedestrians with spatio-temporal graphs.

To model distributions of future trajectories, generative neural networks have been proposed for interaction-aware pedestrian motion prediction. Sadeghian et al. [29] use a Generative Adversarial Network (GAN) [11] that leverages the pedestrian’s path history, and scene images as context with a physical and social attention mechanism. Gupta et al. [12] propose the Social GAN with an extended social pooling mechanism that is not restricted to a limited grid around the pedestrian to predict as in [1]. Our CVM only makes use of the last two timesteps of a pedestrian’s motion history and does not consider interactions.

Besides interaction-awareness, also the environment has been exploited for pedestrian motion prediction. Ballan et al. [2] extract navigation maps from birds eye images and transfer them to new scenes with a retrieval and patch matching procedure to make predictions. Jaipuria et al. [16] propose a transferable framework for predicting the motion of pedestrians on street intersections based on Augmented Semi-Nonnegative Sparse Coding. Lee et al. [20]

predict the motion of vehicles and pedestrians by sampling future trajectory hypotheses from a Conditional Variational Autoencoder 


. They select the most reasonable trajectories by scoring them based on future interactions and consider the environment by encoding it with a Convolutional Neural Network in an occupancy grid map. Bartoli et al. 


consider environmental context by providing a Long Short-term Memory (LSTM) with distances of the target pedestrian to static objects in space, as well as a human-to-human context in form of a grid map, or alternatively the neighbors’ hidden encodings. Aside from neural networks, set-based methods 

[19], Gaussian Processes [13]

and Reinforcement Learning algorithms 

[34] have been proposed for predictions that take into account the pedestrians’ environment. In this work we do not consider the pedestrians’ environment for making prediction but analyze how it can implicitly impact the generalization behavior of more complex and trainable models like neural networks.

3 Problem Formulation

We denote the position of pedestrian at time-step as . The goal of pedestrian motion prediction is to predict the future trajectory for pedestrian , taking into account his or her own motion history . Interaction-aware motion prediction algorithms additionally use information about the motion histories

of neighboring pedestrians that are present in the scene. The problem of finding a parametric model that estimates the future trajectory

can be formulated as


where are the model’s parameters and the number of pedestrians in the scene. This problem is often converted in a sequence-to-sequence prediction problem, where the model can only observe information from the past.

In practice we do not directly predict the next positions for pedestrian , but relative displacements, defined as


Predicting such residuals reduces the error margins compared with directly predicting future positions. Knowing allows us to convert back to .

4 Constant Velocity Model

Based on the assumption that the most recent relative motion of a pedestrian is the most relevant predictor for his or her future trajectory, we propose to use the CVM as a simple but effective prediction method. The CVM predicts that the pedestrian will continue to walk with the same velocity and direction as observed from the last two timesteps. This means we predict the future trajectory for pedestrian as


where the number of is equal to the number of prediction steps. Besides the importance of the pedestrian’s last observed relative motion, this model is based on two more expectations:

  • As it is parameter-free it will not specialize on a specific environment and thus generalize to new scenes.

  • While interactions and collision avoidance behavior happen, their influence on pedestrian trajectories is low.

In the following, we explain our experimental setup to evaluate the performance of the CVM and report our results.

5 Experiments

Metric Dataset ConstAcc Lin FF LSTM RED OUR SoPhie11footnotemark: 1 S-GAN22footnotemark: 2 OUR-S
ETH-Uni 2.22 0.80 0.97 0.82 0.83 0.82 0.70 0.75 0.66
Hotel 1.13 0.41 2.92 3.49 0.48 0.29 0.76 0.37 0.21
ADE Zara1 0.61 0.46 0.43 0.36 0.38 0.35 0.30 0.22 0.25
Zara2 0.52 0.40 0.39 0.48 0.33 0.32 0.38 0.23 0.22
UCY-Uni 0.81 0.59 0.64 0.72 0.49 0.47 0.54 0.39 0.35
AVG 1.06 0.53 1.07 1.17 0.50 0.45 0.54 0.39 0.34
ETH-Uni 5.67 1.62 1.98 1.66 1.69 1.72 1.43 1.37 1.31
Hotel 2.92 0.87 4.84 6.91 1.00 0.55 1.67 0.72 0.39
FDE Zara1 1.57 0.97 0.93 0.81 0.84 0.79 0.63 0.41 0.50
Zara2 1.34 0.83 0.82 1.13 0.74 0.71 0.78 0.43 0.46
UCY-Uni 2.10 1.19 1.32 1.60 1.05 1.05 1.24 0.70 0.73
AVG 2.72 1.09 1.98 2.42 1.06 0.96 1.15 0.73 0.68
Table 1: ADE and FDE of all evaluated models per scene and on average. All numbers are reported in meters. OUR-S outperforms compared state-of-the-art models. Results from [31]. Improved version from
Figure 2: Scenes predicted with OUR-S and 20 drawn samples as explained in Sec. 5.2. Even though the pedestrians’ trajectories are non-linear, our model generates close approximations.

To evaluate the performance of the CVM we use two popular datasets for pedestrian motion prediction. These are the ETH [26] and UCY [22] dataset. These datasets contain a total of five scenes: ETH-Uni and Hotel (from ETH) and Zara1, Zara2 and UCY-Uni (from UCY). They are based on real-world video recordings from different scenarios like pedestrian walking zones and university campuses. Both datasets were heavily used to evaluate the performance of motion prediction models in recent contributions [31, 1, 12, 33, 29].

In our experiment we observe the last 8 positions of a pedestrian’s trajectory and predict the next 12 timesteps. This corresponds to an observation window of 3.2 seconds and a prediction for the next 4.8 seconds, which is an established setting and used in other motion prediction papers as well [31, 1, 12, 33, 29]. This means we slice trajectories with a sliding window and step-size of one into sequences of length 20. We reject sliced trajectories with a length shorter than 10. By this we guarantee an observation of 8 timesteps and that the evaluated models must predict at minimum the next two timesteps.

As proposed in [1], we train on four scenes and evaluate on the remaining one in a leave-one-out cross-validation fashion. This ensures the evaluation of the model’s generalization capability to new scenarios.

Like in related contributions we report errors in meters and evaluate all models with the following metrics:

  • Average Displacement Error (ADE) — Average L2 distance between all positions in ground truth and predicted trajectory.

  • Final Displacement Error (FDE) — L2 distance between last position in ground truth and predicted trajectory.

5.1 Training

We trained each model – except the CVM that does not require training – with the Adam Optimizer [17]

, learning rate 0.0004, batch size 64 and for 35 epochs. As loss function we used the

Mean Squared Error

. We randomly split the scenes into a training set and a 10% validation set to detect overfitting. All models converged without overfitting and did not require further hyperparameter tuning. State-of-the-art models from other contributions were trained as described in respective papers, including specified data augmentations and loss functions.

5.2 Models

To evaluate the performance of our CVM we compare it with multiple baselines that are commonly used in contributions to the pedestrian motion prediction domain:

  • Linear Regression (Lin

    ) — Multivariate multi-target linear regression model that estimates each component in the predicted trajectory as an independent linear regression. Each predicted variable depends on the full motion history.

  • Constant Acceleration (ConstAcc) — Observes the last three positions of a pedestrian and assumes he or she continues to walk with the same acceleration.

  • Feed Forward Neural Network (FF

    ) — Fully connected neural network that receives all eight motion history timesteps as a flattened vector. It then applies two hidden layers with 60 and 30 neurons, respectively. Both hidden layers are followed by ReLu activations. The final linear output layer has 24 outputs, which corresponds to 12 prediction timesteps.

  • LSTM Network (LSTM) — Stacked LSTM [15] that receives single positions and linearly embeds them in a 32 dimensional vector. Three LSTM layers with 128 hidden dimensions and a linear output layer follow. We trained with Teacher Forcing [32] on full 20 timestep sequences.

We further include three state-of-the-art models in our evaluation. These are the RNN-Encoder-MLP (RED) [6] that won the TrajNet benchmark [28], and generative models Social GAN (S-GAN) [12] and the SoPhie GAN (SoPhie) [29]. For S-GAN we selected the best performing version of their model, which is not interaction-aware as reported in [12]. We denote our CVM as OUR. Because S-GAN and SoPhie were evaluated by drawing 20 samples and taking the predicted trajectory with the minimum error for evaluation, we added an extended version OUR-S of our CVM for comparability. For OUR-S we add additional angular noise to its predicted direction, which we draw from with and evaluate its error in the same fashion.

Unlike S-GAN and SoPhie that are based on GANs, OUR-S allows the association of a likelihood with each sampled future trajectory. In a realistic scenario this is crucial, as otherwise an autonomous agent would have to plan its actions based on all predicted trajectories and assume they are equally likely. Associated likelihoods enable a hierarchical planning and to disregard trajectories that are very unlikely to happen.

5.3 Results

Quantitative. In Tab. 1 we display the prediction errors for all evaluated models on each scene. On average the best performing model for both ADE and FDE was OUR-S, which is our CVM with sampling. It also outperformed state-of-the-art generative models S-GAN and SoPhie. As explained in the previous section OUR-S, S-GAN and SoPhie were evaluated by considering only the error of the best predicted sample. For this reason, the other models are discussed separately. Of the models without sampling, OUR outperformed all other models as well, including RED. It’s advantage was especially strong for the Hotel scene. Among the basic neural networks FF and LSTM, FF outperformed LSTM. We hypothesize that an error accumulation effect for the LSTM could be responsible for this, as it predicts the next step based on its output for the last step. ConstAcc performed the worst, which shows that especially over long prediction horizons the assumption of continual acceleration or deceleration is detrimental. To our surprise, Lin performed better than LSTM compared to what [31] reported. We believe this discrepancy can be attributed to the data augmentation they used for all their models. The good performance of Lin can be explained with the high bias and thus strong generalization of linear regression models. We explain in Sec. 6.1 why this effect is especially strong for Hotel.

Qualitative. Fig. 1 shows predictions of OUR on UCY Uni. The worst prediction OUR made is for the pedestrian close to the top left, who suddenly makes a sharp turn. This behavior is not predictable based on a pedestrian’s motion history. The rightmost pedestrian walks in a gradual curve. As our model makes linear predictions it can not extrapolate this behavior. However, often these trends are not reliable and the trajectory curvature suddenly changes. For example the bottommost pedestrian is first taking a turn to the left, but then changes his or her trajectory back to the right, which is difficult to foresee. Overall, the linear predictions of our model are accurate and good approximations for the pedestrians’ future trajectories. In Fig. 2 we show two scenes from Zara1 and Zara2 with predictions from OUR-S. As described in Sec. 5.2 we sampled 20 trajectories from our model. The width of the prediction cone can be controlled with during sampling. Our model is able to predict accurate linear approximations of the pedestrians’ future trajectories, even for those pedestrians that walk in a curved fashion. The four standing pedestrians are correctly predicted to not move in future as well. We show more qualitative examples in Appendix A and B.

6 Analysis of Influencing Factors

Even though the neural networks we included in our experiments in Sec. 5 are computationally more powerful and have access to additional information to make predictions, the CVM outperformed them. To understand these results, we analyze how this additional information is utilized and impacts the predictions of neural networks. In particular, we analyze the influence of environmental priors, motion history and pedestrian interactions on their performance. For our analysis in this section we use the feed forward neural network (FF) described in Sec. 5.3

, as this type of network is better understood and shows simpler training dynamics than recurrent neural networks.

6.1 Environmental Priors

The environment of each scene puts constraints and bias on how pedestrians move within it. Constraints can be physical, for example certain areas like buildings cannot be traversed. Bias can be caused by the semantics of a scene, e.g. on a parking lot of a shopping center people are likely to either walk towards the shopping center, or away form it.

We argue that even though this information was not explicitly provided to the networks in Sec. 5, they implicitly learn such prior. Each area of a scene corresponds to a specific numeric range of input coordinates. This allows a neural network to associate these areas with certain motion patterns. But even if we eliminate this prior, it still learns motion patterns that are typical for the whole scene.

Figure 3: Trajectories from scene Hotel and other scenes combined. We sub-sampled the datasets for better visibility. The majority of trajectories in Hotel are oriented vertically, whereas in the other scenes most trajectories run horizontally. This causes the learning of an environmental prior which contradicts the Hotel scene at test time.

To verify this intuition, we compare the training methodology from Sec. 5.3 (Basic, corresponds to model FF in Sec. 5.3) with two modified methods that dampen the effects of environmental priors. For the first modification (Relative) we do not feed the network with absolute positions as inputs, but instead with the pedestrian’s past relative motion, which we compute analog to in Sec. 3. This ensures that the model can not learn to associate certain areas in the scene with a specific motion pattern. For the second modification (Relative + Rotations) we additionally add random rotations to the training trajectories to reduce directional bias. For this purpose we sample from with . Note that we only apply one rotation to each sample in the training dataset, such that the resulting dataset has the same size for all three training variations. This ensures a fair comparison.

The ADE and FDE for each variant are displayed in Tab. 2. For the rows with Avg we report the average errors of the leave-one-out experiment as in Sec. 5. Our results show that Relative, as well as the additional Rotations strongly improved the model’s performance. This effect is even intensified for scene Hotel, which we therefore report additionally. To understand why Hotel was more influenced by learning environmental priors, we plotted a sub-sample of the trajectories of the Hotel scene, as well as of all other scenes in Fig. 3. It shows that most trajectories in the other scenes run horizontally, whereas in Hotel most trajectories run vertically. We conclude that the network learned this motion pattern as a prior that negatively influenced its generalization capability. While the benefits of data augmentation have been reported earlier [29, 6], our analysis explains why it is beneficial and raises awareness for this problem. This awareness is necessary to understand certain phenomena, for example we believe that learning environmental priors is the primary reason for the bad performance of LSTMs reported in [31], instead of the missing interaction-awareness as the authors suggested.

Basic Relative Relative + Rotations
ADE Avg 1.07 0.51 0.48
FDE Avg 1.98 1.09 1.01
ADE Hotel 2.92 0.50 0.34
FDE Hotel 4.84 1.08 0.64
Table 2: Effects of learning environmental priors on the network’s generalization capability. Our data augmentations strongly improved the performance.

6.2 Motion History

It is believed that neural networks can use longer motion histories to make better predictions. In Sec. 5 we provided our models with a history of eight timesteps, which is common practice in the domain of pedestrian motion prediction [12, 29, 1, 31]. However, the success of the CVM that only uses the last two timesteps to make predictions suggests that for pedestrian motion prediction long histories are not as relevant as believed. To isolate the effects of learning environmental priors and the motion history, we use the the model trained with relative motion history and rotation augmentation from Sec. 6.1.

To evaluate if our feed forward model utilizes the full motion history to make predictions, we compute gradient norms for each timestep with respect to the predicted trajectory. In particular, after training we keep our network static and summarize predicted trajectories in a scalar value by summing up the absolute values of the network’s predicted displacements. As our network is a function of the motion history we can compute gradients


for each timestep in history . Then we compute the norm of each gradient and evaluate it for all testset trajectories with respectively trained models. We sum the gradient norms for all trajectories per timestep and normalize the resulting values to a distribution, such that they sum up to one. This distribution expresses how much influence each timestep in motion history has on average on the output trajectory. We found that the latest relative motion has an influence of 68% on the predicted trajectory, while timestep contributes only 8.7%. The other 5 timesteps in the relative motion history influenced the prediction with 4% – 6% and did not decrease monotonically, but fluctuated. This fluctuation would be counter-intuitive if all timesteps contain predictive information. Combined with the strong drop of influence from timestep to , this indicates that earlier timesteps in the history are not predictive.

Figure 4: Pearson correlation coefficients between all timesteps of the observed relative motion histories for X and Y. All timesteps are highly correlated.

We further computed the linear correlation coefficients between all timesteps in the history for the and components. The resulting correlations are displayed in Fig. 4. All timesteps are highly correlated, with correlation coefficients ranging from 0.89 to 1.0. The closer the timesteps are, the more they are correlated. This high correlation means that the timesteps in the motion history contain much shared and thus redundant information. High correlation of features is an undesirable property in regression problems such as motion prediction and explains why our CVM performs good without considering the longer past. For the neural network we evaluated, this redundant information likely acts as noise rather than signal. This explains why it strongly relies on the latest timestep for prediction.

To empirically confirm our findings we analyzed if the performance of the neural network deteriorates by successively depriving it more of the pedestrian’s motion history. In particular, we train the network with relative histories of sizes

, but let it always predict the next 12 timesteps as usual. For all deprivation steps the network’s ADE fluctuated in range 0.47 – 0.48 and the FDE in range 0.99 – 1.01. The prediction errors did not change in a monotonic fashion and are so small that they can be attributed to randomness in the network’s training process, i.e. weight initialization and stochastic gradient descent. This means the neural network achieved the same performance with a history of size one and seven. Our findings confirm that contrary to popular belief, for pedestrian motion prediction a long motion history is not predictive for the future. It also explains how the CVM can outperform state of the art without this information.

6.3 Pedestrian Interactions

To behave in an interaction-aware manner, a person must anticipate the future motion of his neighbors. Only then it is possible to plan his or her trajectory such that potential collisions are avoided. This implies that a neural network that predicts an interaction-aware trajectory for a pedestrian must implicitly and simultaneously predict the future trajectories of the pedestrian’s neighbors. The CVM does not receive information about surrounding pedestrians and is not making interaction-aware predictions. Its strong performance hints that either interactions do not have a strong influence on the pedestrians’ trajectories, or the state-space of interaction-aware motion prediction is too large. This could be the reason that a model can not predict robust solutions that reliably decrease the expected error.

To analyze this, we compare three variations of our feed forward network. The first model does not receive any neighborhood information (Basic) and is equivalent to the best model from Sec. 6.1. The second model (History) receives the last eight history steps – including timestep – of the pedestrian’s 12 closest neighbors. Note that also state-of-the-art models like [12], [31] or [1] receive information about the neighbors’ past, but indirectly through specialized pooling modules. The third model (Future) is provided with 12 true future positions of the pedestrian’s 12 closest neighbors. It has perfect information about the neighbors’ future and should be able to utilize it if this information is relevant for making predictions. We chose 12 as the number of observed neighbors, because the average number of neighbors across all scenes is 11.34 and provide all neighbor positions relative to the position of the target pedestrian . We order the observed neighbors by their distance to pedestrian at timestep

in ascending order and pad missing neighbors with zeros. We did not include neighbors with partial trajectories as this had a negative impact on the prediction performance of History and Future. We re-trained each model variant separately as in Sec. 


Figure 5: Prediction of the model that received the neighbors’ future trajectories as input. Despite this additional information it predicted a colliding trajectory.

Tab. 3 shows the results of our experiments for all three variants. The model that received future motion information showed a moderate performance gain for the FDE, but no improvement of the ADE. We also observed performance fluctuations for the FDE between throughout several training runs. This indicates that taking interactions into account can lead to better long-term predictions, but their average impact is low. The model that received the neighbors’ motion histories performed slightly worse than the model that received no additional information. This is likely because pedestrian interactions are too complex and the model is not able to find robust solutions while internally predicting for all pedestrians simultaneously. Instead, the information about the neighbors’ histories acts as noise.

Basic History Future
ADE Avg 0.48 0.53 0.49
FDE Avg 1.01 1.07 0.96
Table 3: Influence of neighborhood information on the prediction performance. The model that received future information improved moderately, but providing it with the neighbors’ histories had no positive impact.

Fig. 5 shows a failed prediction of the Future model, where the model predicted a path that would cause a collision with the standing neighbors. Besides such failures, we also observed predictions that could be interpreted as interaction-aware, but these were so infrequent that rather chance and not interaction-awareness caused them. Furthermore, most true trajectories do not involve obvious interaction-aware behavior. These mixed observations agree with the moderate FDE improvements of our quantitative evaluation.

Our analysis indicates that interactions are less relevant than commonly believed. Interaction-aware predictions are too complex to solve only based on neighbors’ motion histories. Our results are consistent with the observations made by [12] that including interaction-awareness by providing the model with the neighbors’ motion histories does not lead to performance gains and can even be detrimental.

7 Discussion

In this work we showed that the CVM is a simple, but strong approach for pedestrian motion prediction. We extensively evaluated the CVM and compared it with other common baselines and three state-of-the-art models. In this comparison the CVM achieved the best results and even outperformed state of the art based on two popular pedestrian motion datasets and common metrics. To understand its success, we analyzed why neural networks – despite their computational power – do not make use of the additional information they are provided with. We found that neural networks learn environmental priors that have a negative impact on their generalization capability and explained how data augmentation can help to alleviate this problem. Furthermore, we analyzed why the observation of a long motion history does not contribute to better predictions for neural networks, compared to our CVM. Neural networks largely ignore the past and focus on the most recent motion of a pedestrian to make predictions. As the CVM does not take into account interactions, we further analyzed how neural networks use information about neighboring pedestrians. Our experiments indicate that interaction-awareness can potentially help to moderately improve predictions. But it is a difficult problem and infeasible only based on neighbors’ motion histories. While we conducted our analysis with a feed forward neural network, our results are likely transferable to other architectures and datasets.

Compared to the CVM that we propose for pedestrian motion prediction, current state-of-the-art models are much more complex. The trend towards complex models is not limited to pedestrian motion prediction, but also ongoing in other domains. Complex models can be justified if they significantly contribute to the performance improvement on a problem and the sources of these performance gains have been identified with ablation studies [23]. Our experiments and the success of the CVM show that it is valuable not to forget and appreciate simple models if they perform equally well. Simple models generalize better, are more robust to shifts in the data distribution (see cross-data generalization in [30]), fast and easier to interpret. But most importantly, understanding why these models perform well leads to a deeper understanding of the problem at hand. This has also been demonstrated in other domains, such as image classification [8] and captioning [10], or reinforcement learning [25].

In future it would be interesting to see which of our results are transferable to other motion prediction scenarios. For example for vehicles interactions-aware motion prediction could be possible as they move in highly structured environments that limit the possibilities for interactions. Furthermore, our analysis could be used to develop better datasets for pedestrian motion prediction. As the environment has a strong influence on predictions, datasets with more diverse environments could help to explicitly exploit this information while maintaining generalization capability.


  • [1] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese. Social lstm: Human trajectory prediction in crowded spaces. In

    Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2016.
  • [2] L. Ballan, F. Castaldo, A. Alahi, F. Palmieri, and S. Savarese. Knowledge transfer for scene-specific motion prediction. In European Conference on Computer Vision (ECCV), 2016.
  • [3] F. Bartoli, G. Lisanti, L. Ballan, and A. Del Bimbo. Context-aware trajectory prediction. In International Conference on Pattern Recognition (ICPR), 2018.
  • [4] A. Bauer, K. Klasing, G. Lidoris, Q. Mühlbauer, F. Rohrmüller, S. Sosnowski, T. Xu, K. Kühnlenz, D. Wollherr, and M. Buss. The autonomous city explorer: Towards natural human-robot interaction in urban environments. International Journal of Social Robotics (IJSR), 2009.
  • [5] R. Baxter, M. Leach, S. Mukherjee, and N. Robertson. An adaptive motion model for person tracking with instantaneous head-pose features. Signal Processing Letters, 2015.
  • [6] S. Becker, R. Hug, W. Hubner, and M. Arens. Red: A simple but effective baseline predictor for the trajnet benchmark. In Workshop on Anticipating Human Behavior (ECCV), 2018.
  • [7] M. Bennewitz, W. Burgard, G. Cielniak, and S. Thrun. Learning motion patterns of people for compliant robot motion. International Journal of Robotics Research (IJRR), 2005.
  • [8] W. Brendel and M. Bethge.

    Approximating CNNs with bag-of-local-features models works surprisingly well on imagenet.

    In International Conference on Learning Representations (ICLR), 2019.
  • [9] J. Cui, H. Zha, H. Zhao, and R. Shibasaki. Tracking multiple people using laser and vision. In International Conference on Intelligent Robots and Systems (IROS), 2005.
  • [10] J. Devlin, S. Gupta, R. Girshick, M. Mitchell, and C. L. Zitnick. Exploring nearest neighbor approaches for image captioning. arXiv, 2015.
  • [11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Conference on Neural Information Processing Systems (NeurIPS), 2014.
  • [12] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi. Social gan: Socially acceptable trajectories with generative adversarial networks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [13] G. Habibi, N. Jaipuria, and J. P. How. Context-aware pedestrian motion prediction in urban intersections. arXiv, 2018.
  • [14] D. Helbing and P. Molnar. Social force model for pedestrian dynamics. Physical Review E, 1995.
  • [15] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 1997.
  • [16] N. Jaipuria, G. Habibi, and J. P. How. A transferable pedestrian motion prediction model for intersections with different geometries. arXiv, 2018.
  • [17] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference for Learning Representations (ICLR), 2015.
  • [18] D. Kingma and M. Welling. Auto-encoding variational bayes. In International Conference on Learning Representations (ICLR), 2014.
  • [19] M. Koschi, C. Pek, M. Beikirch, and M. Althoff. Set-based prediction of pedestrians in urban environments considering formalized traffic rules. In International Conference on Intelligent Transportation Systems (ITSC), 2018.
  • [20] N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chandraker. Desire: Distant future prediction in dynamic scenes with interacting agents. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [21] A. Leigh, J. Pineau, N. Olmedo, and H. Zhang. Person tracking and following with 2d laser scanners. In International Conference on Robotics and Automation (ICRA), 2015.
  • [22] A. Lerner, Y. Chrysanthou, and D. Lischinski. Crowds by example. In Computer Graphics Forum, 2007.
  • [23] Z. Lipton and J. Steinhardt.

    Troubling trends in machine learning scholarship.

    In International Conference on Machine Learning Debates (ICML), 2018.
  • [24] M. Luber, J. Stork, G. Tipaldi, and K. Arras. People tracking with human motion predictions from social forces. In International Conference on Robotics and Automation (ICRA), 2010.
  • [25] H. Mania, A. Guy, and B. Recht. Simple random search of static linear policies is competitive for reinforcement learning. In Conference on Neural Information Processing Systems (NeurIPS), 2018.
  • [26] S. Pellegrini, A. Ess, K. Schindler, and L. Van Gool. You’ll never walk alone: Modeling social behavior for multi-target tracking. In International Conference on Computer Vision (ICCV), 2009.
  • [27] A. Rasouli and J. Tsotsos. Autonomous vehicles that interact with pedestrians: A survey of theory and practice. arXiv, 2018.
  • [28] A. Sadeghian, V. Kosaraju, A. Gupta, S. Savarese, and A. Alahi. Trajnet: Towards a benchmark for human trajectory prediction. arXiv, 2018. (visited: 2019-02-21).
  • [29] A. Sadeghian, V. Kosaraju, A. Sadeghian, N. Hirose, and S. Savarese. Sophie: An attentive gan for predicting paths compliant to social and physical constraints. arXiv, 2018.
  • [30] A. Torralba and A. Efros. Unbiased look at dataset bias. In Conference on Computer Vision and Pattern Recognition (CVPR), 2011.
  • [31] A. Vemula, K. Muelling, and J. Oh. Social attention: Modeling attention in human crowds. In International Conference on Robotics and Automation (ICRA), 2018.
  • [32] R. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1989.
  • [33] Y. Xu, Z. Piao, and S. Gao. Encoding crowd interaction with deep neural network for pedestrian trajectory prediction. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [34] B. Ziebart, N. Ratliff, G. Gallagher, C. Mertz, K. Peterson, A. Bagnell, M. Hebert, A. Dey, and S. Srinivasa. Planning-based prediction for pedestrians. In International Conference on Intelligent Robots and Systems (IROS), 2009.


Appendix A Prediction Examples of OUR

Figure 6: Predictions of the Constant Velocity Model.

Appendix B Prediction Examples of OUR-S

Figure 7: Predictions of the Constant Velocity Model with 20 sampled trajectories and .