Pedestrian trajectory prediction is a challenging task that is gaining increasing attention in recent years because its applications are becoming more and more relevant. These applications include human surveillance, socio-robot navigation and autonomous driving. Because these areas have become more important and demanding over time, methods to approach the problem of pedestrian trajectory prediction have evolved, transitioning from physics-based models to data-driven models that use deep learning. One of the main sources of information that these models use is the past trajectory, and thus its representation is has a great impact. Moreover, the deep learning architectures used are sequence-to-sequence, which have evolved beyond recurrent models during the last years.
One of the first approaches in pedestrian behaviour modelling was introduced by Helbing et al. and it is called Social Forces Model Helbing and Molnár (1995). Physics-based models like this have been extensively developed in the past, with the introduction of other techniques such as BRVO Kim et al. (2015). However, in recent years the data-driven approach to pedestrian behaviour modelling has become increasingly popular, thanks to its promising results. One of the most influential neural networks architecture in pedestrian trajectory prediction was introduced by Alahi et al. under the name of Social LSTM Alahi et al. (2016)
. Since then several different deep learning architectures have been proposed. Common elements in these recent works are the use of Generative Adversarial NetworksGupta et al. (2018), the use of Graph Neural Networks Vemula et al. (2017), the integration of attention Fernando et al. (2017) and the inclusion of spatial Pfeiffer et al. (2018) and image information Sadeghian et al. (2019).
Despite the vast number of different neural network-based approaches, there are still some unexplored aspects. The first one is data pre-processing. Pedestrian trajectory prediction models get past positions as input, however, there is no detailed study investigating if these coordinates should be normalized and what is the best normalization technique. Moreover, the total amount of publicly available data is limited, while it is widely understood that neural networks perform better with a vast amount of data. To address the issue of limited data, a solution could be to use data augmentation. However, this approach is often not explored in detail in publications. Consequently, it is currently unknown what normalization and data augmentation techniques are most effective in pedestrian trajectory prediction.
Another topic hardly explored in literature, Nikhil and Morris (2018) being the exception, is the use of Convolutional Neural Networks (CNN) in pedestrian trajectory prediction. In the machine translation and image caption fields it was proved, in works such as Gehring et al. (2017) and Aneja et al. (2018), that CNN are a valid alternative to Recurrent Neural Networks (RNN). However, in pedestrian trajectory prediction, a detailed confrontation is still missing.
Consequently, the objective of this work is to find effective pre-processing techniques and to develop a convolutional model capable of outperforming models based on RNN. Models presented in this work are designed to be employed in scenarios in which only the past positions (in meters) of each pedestrian in a certain area are known. It is assumed that no information is available about the environment in which pedestrians move.
Fulfilling the outlined objectives the main contributions of this work are the following:
The identification of effective position normalization techniques and data augmentation techniques, such as random rotations and the addition of Gaussian noise;
The introduction of a novel model based on 2D convolutions capable of achieving state-of-the-art results on the ETH and Trajnet dataset.
As a plus, we also present experimental results obtained including social information in the convolutional model. These experiments empirically show that occupancy methods are ineffective to represent social information.
2 Related Work
Early work from Helbing and MolnarHelbing and Molnár (1995) pioneered the use of physics-based models for predicting human trajectory. Their approach, the Social Forces model, considers every pedestrian as a particle subject to forces from nearby people and obstacles, and the sum of these forces gives the next pedestrian position. Physics-based pedestrian behaviour modelling has evolved over time, with the introduction of advanced techniques such as BRVO Kim et al. (2015), which builds upon Reciprocal Velocity Obstacle (RVO) van den Berg et al. (2011) and the Optimal Reciprocal Collision Avoidance (ORCA) Alonso-Mora et al. (2013).
These physics-based models, however, are limited by the fact that they use hand-crafted function, and thus they can represent only a subset of all possible behaviours. Deep learning models are data-driven and thus do not have this limitation. In literature deep learning models for pedestrian trajectory prediction rely mainly on the use of Recurrent Neural Networks (RNN), in particular on the Long Short-Term MemoryHochreiter and Schmidhuber (1997) (LSTM) cells. One of the first works using such approach that pioneered the use of deep learning in pedestrian trajectory prediction is the Social LSTM model Alahi et al. (2016). In this model, pedestrian trajectory together with social information is fed to an LSTM. Social information is used to model social interaction and it is represented as a grid containing nearby pedestrians.
Later works have employed more advanced techniques together with the RNN approach, such as attention. Attention was first applied in the machine translation field Bahdanau et al. (2014), and one of the first work to use it for pedestrian trajectory prediction was introduced by Fernando et al. Fernando et al. (2017). Since then multiple works have used attention in different parts of the architecture Xue et al. (2019) Sadeghian et al. (2019) Xu et al. (2018) Warnakulasuriya et al. (2019) Amirian et al. (2019).
Generative Adversarial Networks (GAN) Goodfellow et al. (2014) are a way to generate new synthetic data similar to training data. GAN have been seen as a way to address the multi-modal aspect of pedestrian trajectory prediction. One of the first work to use a GAN for creating multiple pedestrian trajectories was the Social GANGupta et al. (2018) model. In recent years the generative approach for pedestrian trajectory prediction has been extensively explored by other works using not only GAN Sadeghian et al. (2019) Warnakulasuriya et al. (2019) Kosaraju et al. (2019), but also using Conditional Variational Auto-Encoders (CVAE) Amirian et al. (2019) Cheng et al. (2020) Li et al. (2019).
Another approach to solve the pedestrian trajectory prediction problem is using Graph Neural Networks (GNN), first applied by Vemula et al. (2017) and then also used in other works Haddad et al. (2019) Huang et al. (2019) Ivanovic and Pavone (2019) Sun et al. (2019) Zhang et al. (2019a) Mohamed et al. (2020).
Some authors have also tried to use other available sources of information to predict the future trajectory. Some works use spatial information represented as points of interest Bartoli et al. (2018), as an occupancy map Pfeiffer et al. (2018), or as a semantic segmentation of the scene Minoura et al. (2019) Lisotto et al. (2019). Meanwhile, other works use image information extracted directly from the dataset videos Sadeghian et al. (2019) Xue et al. (2018) Cheng et al. (2020) Kosaraju et al. (2019) Li et al. (2019).
In this section, the problem is first formally presented. Then we describe different approaches to data-preprocessing and the proposed convolutional architecture. Finally, we present recurrent baselines and the chosen approaches to include social information.
3.1 Problem formulation
The goal of pedestrian trajectory prediction is to predict pedestrians future positions given their previous positions. Concretely, given a scene where pedestrians are present, their coordinates are observed for a certain amount of time, called , and the task is to predict the future coordinates of each pedestrian from to (assuming that time start at 0). A discretization of time is assumed, in which the time difference between time and time is the same as the time difference between time and time . The position of each pedestrian is characterized by its coordinates (in meters) with respect to a fixed point. Therefore, for pedestrian the positions for t are observed and positions for t are predicted. We denote all the past positions of a pedestrian with , the predicted future positions with and the real future positions of pedestrian with . In essence, the problem of pedestrian trajectory prediction can be stated as:
How to predict the future positions of pedestrians from their past trajectory with the lowest possible error?
3.2 Data pre-processing
To effectively train a model and achieve low error rate, it is important to pre-process the data. The way this has been done is by normalizing the input coordinates and applying data augmentation techniques.
3.2.1 Data normalization
The input and target data of models in pedestrian trajectory prediction are coordinates, however, the origin point of these coordinates is not specified. Therefore, one might ask: which coordinate system to use, as a form of data normalization? To answer this question, we have identified four data-preprocessing techniques:
Absolute coordinates. With absolute coordinates, we refer to the naive approach: taking directly the coordinates from the datasets as they are. This is not a sensible approach since each scene has the origin point in a different position, and thus coordinates can lie in very distant intervals.
Coordinates with the origin in the first observation point (). From each point in the sequence the first position is subtracted. In this way, the coordinates became scene-independent and do not have the same drawbacks as absolute coordinates.
Coordinates with the origin in the last observation point (). Similar to the previous coordinates type, but with the difference that the subtracted position is the one at , which is the last position the network will observe.
Relative coordinates (velocities). In this case instead of coordinates with a fixed reference system, the network is fed with relative displacements. It is to note that if relative displacements are scaled accordingly to the annotations per seconds they represent the instantaneous velocities.
An example of the same trajectory represented in different coordinate systems can be found in Figure 1.
3.2.2 Data Augmentation
The following data augmentation techniques have been analyzed:
Apply a random rotation to each trajectory. This technique makes the network learn patterns in a rotation-invariant manner.
Apply Gaussian noise with mean 0 to every point. This should make the network more robust to small perturbations and imprecisions.
3.3 Convolutional Model
, CNN can be applied to problems involving sequences, such as machine translation or image captioning, achieving competitive results in comparison with RNN. It has also been shown by Nikhil and Morris inNikhil and Morris (2018), that indeed a convolutional model can be employed in pedestrian trajectory prediction. However, in their architecture it is not explained in detail how to go from 8 input positions to 12 output positions, and how to transform output features in future positions. Moreover, their model does not outperform recurrent models such as SoPhie Sadeghian et al. (2019).
For the reason just stated we introduce a new convolutional model specifically designed for pedestrian trajectory prediction. This model takes 8 input positions and outputs the future 12 positions, as it is commonly done in literature. The general architecture of the model can be viewed in Figure 3.
Initially each input positions is embedded in 64-length feature vector by a fully connected layer. After this first step, the input trajectory is represented by features vectors that are arranged in a 64x8 matrix, in which 64 is the embedding dimension and 8 is the number of input positions. This matrix can also be represented as 64 one-dimensional channels with 8 features each, one for each time step. Thus, it is possible to apply 1D convolutions to this matrix. After the embedding, a first group of 1D convolutions with padding is applied. The padding depends on the kernel size of the convolutions and it is employed to keep the number of features in output the same as the number of features in input. This means that as many convolutional layers as wanted can be stacked at this step. The mismatch between the input positions, which are 8, and the output positions, which are 12, require the introduction of specific layers to address this problem. Therefore, first an upsampling layer is applied to double the number of features from 8 to 16, and afterwards convolutional layers without padding are applied to reduce the number of features from 16 to 12. Lastly, a second group of convolutions with padding is applied and then a final fully connected layer transforms each feature vector in an output position.
The presented convolutional model is scalable, in a sense that there is no limit at the number of layers in the initial and final convolutions groups. It is also one-shot: in one pass all the output coordinates are generated, differently from recurrent models where usually one pass gives only the next position. The presented convolutional model is very generic, and some variations of it have also been explored:
Positional embeddings. As proposed by Gehring et al. (2017), to give to the network the clue of order in the input data, the positional information of each input position is used.
Transpose convolution layers instead of the upsampling layer followed by convolutions without padding, to transition from 8 features to 12 features.
2D convolutions instead of 1D convolutions. The 64x8 matrix that is created with all the input position embeddings can also be considered as a one-channel image. 2D convolutions can then be applied to this artificial image. However, 2D convolutions increase the number of channels. Therefore, the final convolutional layer needs to decrease the channels number to one so that the final fully connected layer that computes the future positions can be applied. 2D convolutions have the advantage that they process multiple features over multiple timesteps, while 1D convolutions process only one feature over multiple timesteps. The detailed architecture of the best 2D convolutional model can be found in Figure 4.
3.4 Recurrent baselines
To confront the results obtained using the convolutional model two RNN baselines have been implemented. The first is a simple LSTM. This model embeds with a fully connected layer one position into a 64-length feature vector, which is fed to the LSTM cell and then two fully connected layers transform the output of the cell into the next position. The second baseline is an Encoder-Decoder (shorten to Enc-Dec in the tables) that uses LSTM cells both in the encoder and in the decoder. The encoder has the same architecture as the LSTM baseline except for the two fully connected layers in output which are missing, while the decoder has the exact same architecture as the LSTM baseline.
3.5 Addition of social information
In addition to past trajectory, social information can be used as input to the network. We analyzed three simple ways to represent social information, which use the occupancies of nearby pedestrians in the space. These techniques are:
A square occupancy grid, introduced in Social LSTMAlahi et al. (2016).
A circular occupancy map, introduced in SS-LSTMXue et al. (2018).
An angular pedestrian grid, introduced in Pfeiffer et al. (2018). In this technique the angular space around a pedestrian is divided in a number of intervals and then the closest pedestrian in each direction, within a certain range, is computed.
A visual example of these techniques can be seen in Figure 5.
The square occupancy grid is represented with a matrix where is the number of cells on each side. The circular occupancy map is represented with a matrix where is the number of circles. The angular pedestrian grid is represented by a vector of length , where is the number of degrees an element of the vector represents. Social information which is not already in vector form is flattened to be used as an input to the models.
Social information is integrated into the convolutional model and into the Encoder-Decoder baseline. Both models require minimal modifications: the social information is embedded by another fully connected layer, and then obtained social feature vector is summed to the position feature vector. This new vector represents position and social information for that timestep and it is then fed to the rest of the network. It is important to note that social information is available only during observation (therefore in the Encoder-Decoder baseline the encoder process both social and position information, while the decoder only processes position information).
In this section we describe the used datasets along with the evaluation metrics. Then, we present the experimental results obtained training the proposed model and baselines. Finally, a comparison with literature results is also shown.
The ETHPellegrini et al. (2009) and UCY datasetsLerner et al. (2007) are two publicly available datasets widely used in literature. Jointly they contain five scenes, two from ETH (named ETH and Hotel), and three from UCY (named Univ, Zara1 and Zara2). In total, they contain more than 1600 pedestrian trajectories, with pedestrian positions annotated every 0.4 seconds. The train and test are done with the leave-one-out-cross-validation approach: a model is trained on four scenes and tested on the fifth, and this procedure is repeated five times, one for each scene. Since these two datasets are mainly used jointly from now onward the two datasets together will be referred to as the ETH-UCY dataset. Data was downloaded from the Social GAN repository Gupta (2018), except for the ETH scene for which the original dataset was usedPellegrini et al. (2009). Figure 6 shows a scene from the UCY dataset.
A more recent dataset is the Trajectory Forecasting Benchmark (also known as TrajNet) Sadeghian et al. (2018). It is a curated collection of datasets, comprising in total of more than 8000 pedestrian trajectories. It merges the ETH, UCY, Stanford Drone Dataset Robicquet et al. (2016) and PETS2009 Ferryman and Shahrokni (2009) datasets. The Stanford Drone Dataset contributes to the majority of the pedestrian tracks. One frame is annotated with pedestrian positions every 0.4 seconds. The data has already been split in training and test by the authors, and for the test set only the observed position are available. The test error can be computed only by submitting the obtained predictions to the official dataset site Sadeghian et al. (2018), where a leaderboard is also present. A scene from the Stanford Drone Dataset can be seen in Figure 7.
It is common practice in literature to set and . Work that do this include Alahi et al. (2016) Gupta et al. (2018) Sadeghian et al. (2019) Nikhil and Morris (2018) and many others. Thus, for the sake of confronting the developed models with other architectures, the same setting is used.
The evaluation of predicted trajectories is done using metrics. The first (and most important) metric used is the Average Displacement Error (ADE), which was introduced in Pellegrini et al. (2009). The ADE is the error over all the predicted points and the ground truth points from to averaged over all pedestrians. The ADE formula is the following:
The number of pedestrians is , the predicted coordinates for pedestrian at time are , the real future positions are and is the Euclidean distance.
The second metric used is the Final Displacement Error (FDE), which was also introduced in Pellegrini et al. (2009). The FDE is the error between the predicted position and the real position at . The FDE formula is the following:
4.3 Implementation details
For the ETH-UCY dataset, each network was trained for 60 epochs, with a learning rate of 0.005 and a step scheduler with gamma 0.5 and step 17. For the TrajNet dataset, each network was trained for 250 epochs, with a learning rate of 0.005 and a step scheduler with gamma 0.75 and step 35. The optimizer used was Adam. The loss used was the ADE. For the baselines, the LSTM cell size was 128, and the output dimension of the two fully connected layers in output was 64 and 2 respectively. The basic 1D convolutional model has the same number of layers as the 2D model in figure4
. The differences lie in the number of channels, which is 64 for each layer, and the absence of batch normalization. For the Gaussian noise, the standard deviation is set to 0.05 and the mean to 0. For the mirroring, there is a 25% probability of mirroring a sample on one axis and a 50% probability of not applying any mirroring at all. For the social occupancy information, grid results are obtained using 10 cells per side () and each cell with a side of 0.5m. Occupancy circle results are obtained using 12 circles () 0.5m apart from each other. Angular pedestrian grid results are obtained using 8 degrees per element ().
4.4 Results data pre-processing
Results obtained training the 1D convolutional model with different coordinate normalization approaches can be found in Table 1. The best coordinate normalization is the one in which the origin is in the last observation point, since it achieves the lowest ADE across all five scenes. This is because the last observation point is the most important one, since it is the most recent. Therefore, if the origin is placed in that position all the trajectory is seen through the lens of the most important point and thus network better understands the whole trajectory.
|1.165 / 1.910||10.693 / 11.323||0.727 / 1.489||0.443 / 0.920||0.389 / 0.797||2.684 / 3.288|
|0.731 / 1.393||0.513 / 1.006||0.704 / 1.405||0.432 / 0.923||0.330 / 0.693||0.542 / 1.084|
|0.694 / 1.381||0.568 / 1.241||0.667 / 1.371||0.411 / 0.893||0.324 / 0.694||0.533 / 1.116|
|0.791 / 1.492||0.533 / 1.107||0.699 / 1.366||0.403 / 0.864||0.327 / 0.696||0.550 / 1.105|
Regarding the data augmentation techniques, their effects are shown in Table 2. Adding Gaussian noise is effective, but an even lower average error is achieved applying random rotations to all the trajectory. However, when random rotations and mirroring are applied together, results are, on average, very similar as only using random rotation. If random rotations and Gaussian noise are applied together the lowest average error is achieved, even if in some scenes the error actually increases. Thus, it is possible to conclude that adding Gaussian noise with mean 0 to every point and applying random rotations to the whole trajectory is beneficial.
It is important to note that the same conclusions, both for coordinates normalization and for data augmentation techniques, can be drawn when training the recurrent baselines. This demonstrates that these findings are applicable to a multitude of architectures and not only to convolutional models.
|0.592 / 1.220||0.445 / 1.011||0.669 / 1.375||0.424 / 0.903||0.337 / 0.720||0.493 / 1.046|
|0.668 / 1.296||0.318 / 0.603||0.576 / 1.210||0.471 / 1.046||0.349 / 0.763||0.476 / 0.983|
|0.702 / 1.390||0.298 / 0.596||0.571 / 1.211||0.452 / 0.999||0.330 / 0.726||0.471 / 0.984|
|0.605 / 1.190||0.264 / 0.509||0.588 / 1.241||0.521 / 1.095||0.351 / 0.755||0.466 / 0.958|
4.5 Results convolutional model variations
Results obtained with different convolutional model variations are shown in Table 3. These results suggest that models with a bigger kernel size are able to generate more refined predictions, since the 1D convolutional model with kernel size 7 obtains better results than the same model with kernel size 3. The intuition behind why a bigger kernel size might be better is that the more information a kernel can process the better it can interpret complex behaviours in the trajectory. This idea still applies when the 1D convolution model is confronted with the 2D convolution model. In the first, the kernel looks at the same feature on multiple timesteps. In the second, instead, the kernel looks at multiple features in multiple timesteps and thus it process more information and generates better predictions. However, this intuition has diminishing returns: experiments with the 2D convolutional model using kernel size 7 generated slightly worst results compared to the same 2D model with kernel size 5.
Regarding other convolutional model variations, using positional embedding and transpose convolutions proved to be ineffective. Moreover, adding residual connections also did not improve results, since the optimal number of convolutional layers is quite limited (7, as Figure 4 shows) and thus residual connections are not needed.
Table 4 offers a comparison between the baselines and the proposed convolutional models. The 1D convolutional model is able and outperform the recurrent baselines only when using a bigger kernel size, while the best model is the 2D convolutional one. Thus, We can conclude that it is indeed possible to develop a convolutional model capable of outperforming recurrent models in pedestrian trajectory prediction. However, it is interesting to note that the difference in average error between the recurrent baselines and the convolutional models is not ample.
|0.605 / 1.190||0.264 / 0.509||0.588 / 1.241||0.521 / 1.095||0.351 / 0.755||0.466 / 0.958|
|0.560 / 1.149||0.246 / 0.427||0.590 / 1.249||0.478 / 1.046||0.346 / 0.737||0.444 / 0.931|
|0.568 / 1.125||0.248 / 0.467||0.594 / 1.257||0.459 / 0.990||0.369 / 0.789||0.447 / 0.926|
|0.606 / 1.197||0.267 / 0.517||0.595 / 1.254||0.451 / 0.989||0.356 / 0.762||0.455 / 0.944|
|0.560 / 1.121||0.245 / 0.470||0.589 / 1.251||0.516 / 1.073||0.349 / 0.741||0.452 / 0.931|
|0.559 / 1.114||0.240 / 0.464||0.581 / 1.225||0.456 / 0.993||0.347 / 0.751||0.436 / 0.909|
|0.581 / 1.168||0.259 / 0.503||0.578 / 1.214||0.463 / 1.022||0.347 / 0.748||0.446 / 0.936|
|0.585 / 1.170||0.246 / 0.491||0.589 / 1.245||0.467 / 1.023||0.360 / 0.737||0.449 / 0.938|
|0.560 / 1.149||0.246 / 0.472||0.590 / 1.249||0.478 / 1.046||0.346 / 0.737||0.444 / 0.931|
|0.559 / 1.114||0.240 / 0.464||0.581 / 1.225||0.456 / 0.993||0.347 / 0.751||0.436 / 0.909|
4.6 Results using social information
Results of Table 5, in which the 2D convolutional model is trained with social information, are unexpected: the addition of social information proved to be ineffective on the ETH-UCY dataset.
Similar results are also obtained with the Encoder-Decoder baseline: architectures that use the proposed social occupancy information methods are not able to outperform the same architectures without social information.
This is indicated by the fact that networks with social information obtain very similar results to networks without it, as occupancy information would not be relevant. Upon further investigation, it was found that the average gradient flow in the social information embedding weights of the networks was around 50-100 times smaller than the average gradient flow in the position embedding weights. This might suggest that for the network there is very little correlation between the real future trajectory and social information, and thus this kind of information is almost ignored. An example of the gradient flow in the network can be found in Figure 8.
Results on the addition of social information are to be considered mainly as an exploratory analysis. Much more can be done (and has been done) to include social information as input to a model in pedestrian trajectory prediction. What our results show is that the specific approach to use occupancy information that we tested, in combination with the presented architectures, failed to improve results on the ETH-UCY dataset.
|0.558 / 1.118||0.233 / 0.455||0.604 / 1.269||0.464 / 1.005||0.342 / 0.740||0.440 / 0.915|
|0.561 / 1.122||0.235 / 0.447||0.590 / 1.240||0.461 / 0.991||0.348 / 0.746||0.439 / 0.910|
|0.567 / 1.109||0.235 / 0.449||0.589 / 1.231||0.464 / 0.997||0.337 / 0.719||0.438 / 0.901|
|0.559 / 1.114||0.240 / 0.464||0.581 / 1.225||0.456 / 0.993||0.347 / 0.751||0.436 / 0.909|
4.7 Comparison with literature on the ETH-UCY dataset
The following models from literature have been chosen to do a comparison with the results obtained on the ETH-UCY dataset:
A simple LSTM, trained by Gupta et al. (2018);
Convolutional Neural Networks for trajectory prediction (shorten to CNN in the table) Nikhil and Morris (2018), convolutional model developed by Nijhil and Morris;
Social-GAN Gupta et al. (2018), a generative model that uses social information;
SoPhie Sadeghian et al. (2019), a generative model that uses both social and image information;
Stochastic Trajectory Prediction with Social Graph Network (Stochastic GNN) Zhang et al. (2019a), generative model that uses social information and GNN;
MCNET Cheng et al. (2020), generative model based on a CVAE that uses both social and image information;
Conditional Generative Neural System (CGNS) Li et al. (2019), generative model based on a CVAE that uses both social and image information;
Social-BiGAT Kosaraju et al. (2019), generative model that uses both social and image information;
SR-LSTM Zhang et al. (2019b), model based on the state refinement of the LSTM cells of all the pedestrians in the scene to take into account for social interaction;
Social Spatio-Temporal Graph Convolutional Neural Network (STGCNN) Mohamed et al. (2020), generative model that uses social information and GNN;
STGATHuang et al. (2019), generative model that uses social information and GNN;
The result comparison can be found in Table 6. In there, the 2D convolutional model achieves the lowest error on the ETH dataset and an average error on the whole ETH-UCY dataset comparable to the STGAT and STGCNN models. On the UCY dataset, however, other models surpass the 2D convolutional model. This might be due to the fact that in the ETH dataset there is less pedestrian density, while in the UCY dataset there are more pedestrians per scene and thus social interaction, which is not taken into account by the 2D convolutional model, is more important. The recurrent baselines also achieve a very low error, especially if the baseline LSTM is compared to the LSTM trained by Gupta et al. (2018), thanks to the employed data pre-processing techniques.
|1.33 / 2.94||0.39 / 0.72||0.82 / 1.59||0.62 / 1.21||0.77 / 1.48||0.79 / 1.59|
|1.09 / 2.35||0.79 / 1.76||0.67 / 1.40||0.47 / 1.00||0.56 / 1.17||0.72 / 1.54|
|LSTM (from Gupta et al. (2018))||1.09 / 2.41||0.86 / 1.91||0.61 / 1.31||0.41 / 0.88||0.52 / 1.11||0.70 / 1.52|
|CNN Nikhil and Morris (2018)||1.04 / 2.07||0.59 / 1.17||0.57 / 1.21||0.43 / 0.90||0.34 / 0.75||0.59 / 1.22|
|Social GAN Gupta et al. (2018)||0.81 / 1.52||0.72 / 1.61||0.60 / 1.29||0.34 / 0.69||0.42 / 0.84||0.58 / 1.18|
|Sophie Sadeghian et al. (2019)||0.70 / 1.43||0.76 / 1.67||0.54 / 1.24||0.30 / 0.63||0.38 / 0.78||0.54 / 1.15|
|Stochastic GNN Zhang et al. (2019a)||0.75 / 1.63||0.63 / 1.01||0.48 / 1.08||0.30 / 0.65||0.26 / 0.57||0.49 / 1.01|
|MCNET Cheng et al. (2020)||0.75 / 1.61||0.37 / 0.68||0.58 / 1.18||0.33 / 0.65||0.23 / 0.49||0.49 / 0.98|
|CGNS Li et al. (2019)||0.62 / 1.40||0.70 / 0.93||0.48 / 1.22||0.32 / 0.59||0.35 / 0.71||0.49 / 0.97|
|Social-BiGAT Kosaraju et al. (2019)||0.69 / 1.29||0.48 / 1.01||0.55 / 1.32||0.30 / 0.62||0.36 / 0.75||0.48 / 1.00|
|SR-LSTM Zhang et al. (2019b)||0.63 / 1.25||0.37 / 0.74||0.51 / 1.10||0.41 / 0.90||0.32 / 0.70||0.45 / 0.94|
|Social STGCNN Mohamed et al. (2020)||0.64 / 1.11||0.49 / 0.85||0.44 / 0.79||0.34 / 0.53||0.30 / 0.48||0.44 / 0.75|
|STGAT Huang et al. (2019)||0.65 / 1.12||0.35 / 1.12||0.52 / 1.10||0.34 / 0.69||0.29 / 0.60||0.43 / 0.83|
|LSTM baseline||0.581 / 1.168||0.259 / 0.503||0.578 / 1.214||0.463 / 1.022||0.346 / 0.748||0.446 / 0.936|
|Enc-Dec baseline||0.585 / 1.170||0.246 / 0.491||0.589 / 1.245||0.467 / 1.023||0.360 / 0.771||0.449 / 0.938|
|0.560 / 1.190||0.246 / 0.472||0.590 / 1.249||0.478 / 1.046||0.346 / 0.737||0.444 / 0.926|
|0.559 / 1.114||0.240 / 0.464||0.581 / 1.225||0.456 / 0.993||0.347 / 0.751||0.436 / 0.909|
4.8 Comparison with literature on the TrajNet dataset
The following models from literature have been chosen to do a comparison with the results obtained on the TrajNet dataset:
Social LSTM Alahi et al. (2016), results are taken from the TrajNet site;
Social GAN Gupta et al. (2018), results are taken directly from the TrajNet site;
Location-Velocity Attention Xue et al. (2019), model that uses location and velocity in two different LSTM with an attention layer between the two cells, the results are taken directly from the paper;
SR-LSTM Zhang et al. (2019b), the results are taken directly from the TrajNet site;
The result comparison can be found in Table 7. We can affirm that the 2D convolutional model achieves state-of-the-art performances on the TrajNet dataset, making it the best model on the biggest publicly available dataset for pedestrian trajectory prediction. The LSTM baseline also achieves a very low error, lower than RED V2, reaffirming the importance of data pre-processing techniques. Finally, also on the TrajNet data the analyzed techniques for modelling social interaction proved to be ineffective (results using a circular occupancy map are missing in Table 7 because their results are very similar to the square occupancy grid) . In fact, both the 2D convolutional model and the Encoder-Decoder baseline outperform their variants that use social information.
|Social LSTM Alahi et al. (2016)||0.675||2.098|
|Social GAN Gupta et al. (2018)||0.561||2.107|
|Location-Velocity Attention Xue et al. (2019)||0.438||1.449|
|Social Forces Helbing and Molnár (1995) (from Becker et al. (2018))||0.371||1.266|
|SR-LSTM Zhang et al. (2019b)||0.370||1.261|
|Enc-Dec baseline with square occupancy grid||0.369||1.231|
|Convolutional 2D model with angular pedestrian grid||0.366||1.223|
|Convolutional 1D model||0.365||1.220|
|Enc-Dec baseline with angular pedestrian grid||0.364||1.218|
|Convolutional 2D model with square occupancy grid||0.360||1.215|
|RED V2 Becker et al. (2018)||0.359||1.207|
|Convolutional 2D model||0.352||1.192|
The ADE and the FDE are not the only aspects that can be taken into consideration when evaluating a pedestrian trajectory prediction model. Other characteristics are the computational time and the number of hyperparameters. Moreover, no model is always accurate and for future improvements it is also important to understand when the proposed architecture fails.
5.1 Convolutional model and recurrent models comparison
Analyzing the recurrent baselines and the convolutional model beyond their quantitative results, three main differences have emerged. The first is computation time. As can be seen in Table 8, the convolutional model is more than three times faster than the Encoder-Decoder baseline and more than four times faster than the LSTM baseline at test time. These results are also valid during training time. Thus, the convolutional model is not only more accurate but also more efficient than the recurrent baselines.
|batch size=1||batch size=32|
|Convolutional 2D model||0.0033s||0.00017s|
|(155k parameters)||per element||per element|
|(106k parameters)||per element||per element|
|(208k parameters)||per element||per element|
The second difference between the recurrent models and the convolutional model is the number of hyperparameters. The LSTM and Encoder-Decoder baselines have a very small number of hyperparameters (embedding size, hidden state length and the output fully connected layers dimension). Meanwhile, the convolutional model has a bigger number of hyperparameters (embedding size, number of layers, number of channels for each layer and kernel size for each layer). Therefore, the convolutional model requires more hyperparameter tuning than the recurrent models.
The third difference is flexibility. The recurrent model can be trained to observe, for example, 6 positions and predict the next 16 without any change in the architecture. This is not true in the case of the convolutional model, in which some adjustments need to be done, probably revolving around the upsampling layer and the convolutional layers without padding.
We can therefore conclude that the convolutional model is more efficient and accurate than the recurrent baselines, but it is less flexible and requires more hyperparameter tuning.
5.2 Failure cases
In some of the applications of pedestrian trajectory prediction, such as autonomous driving, is important to not only to have a small average error but also to have a small maximum error. How well the proposed 2D convolutional model satisfies this constraint can be seen looking at the distribution of the Average Displacement Error in Figure 9. There, it is possible to note that the prediction error distribution resembles a Gaussian curve with a long tail. Analyzing the poor predictions in the long tail we discovered three scenarios in which the prediction error is consistently high:
Sharp turns. In this case, the typical scenario is the following: a person is going straight and then does a 90-degree turn because the road was either turning or forking. An example of such behaviour can be seen in Figure 10. In scenarios like this, it is reasonable to assume that only models including spatial information can predict the turn reliably. What models that do not include spatial information can learn is to adapt quickly to sharp changes in trajectory, as shown in Figure 10.
Pedestrians stopping. In this case, it is often difficult to understand the reasons for this kind of behaviour: a person could stop to look at some shops windows, to check before crossing the street, to greet some friends, or to simply wait for someone else. Spatial information could help on some of these scenarios, but not in all.
Pedestrians that resume walking after stopping. This kind of behaviour happens after the previous one, and it is even more difficult to predict. If a person is still it is very difficult to understand the exact moment when it will resume moving. The safest assumption is that the pedestrian will continue to remain still, which leads to a very high error if the network observation ends a few moments before the person starts walking.
Analyzing these three scenarios it is possible to affirm that, to reduce instances in which the error is very high, the inclusion of spatial information could be very effective. Consequently, as a future work, the inclusion of spatial information in the convolutional model appears to be a promising direction.
In this work, we first confronted various data pre-processing techniques for pedestrian trajectory prediction. We found out that the best combination is obtained using coordinates with the origin in the last observation point as data normalization and applying Gaussian noise and random rotations as data augmentation. This solution proved to be effective in multiple architectures, both convolutional and recurrent, demonstrating that these findings are general.
We also proposed a new convolutional model for pedestrian trajectory prediction that uses 2D convolution. This new model is able to outperform the recurrent baselines, both in average error and in computational time, and it achieves state-of-the-art results on the ETH and TrajNet datasets.
As an additional exploratory analysis, we also presented empirical results on the inclusion of social occupancy information. Our results suggest that the inclusion of social occupancy information does not reduce the prediction error.
Accompanying these quantitative results, a comparison between the convolutional and recurrent models was presented, together with an analysis of the most common failure scenarios in the predictions.
This work is the result of Simone Zamboni’s master thesis project carried out at SCANIA Autonomous Transport Systems. We thank the support of the industry partner, SCANIA, and the support of the university partner, KTH.
Social LSTM: Human trajectory prediction in crowded spaces.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 961–971. Cited by: §1, §2, item 1, item 3, item 1, §4.2, Table 6, Table 7.
- Optimal reciprocal collision avoidance for multiple non-holonomic robots. In Distributed Autonomous Robotic Systems, Vol. 83, pp. 203–216. Cited by: §2.
- Social ways: Learning multi-modal distributions of pedestrian trajectories with GANs. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Cited by: §2.
- Convolutional image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition., pp. 5561–5570. Cited by: §1, §3.3.
- Neural machine translation by jointly learning to align and translate. In ArXiv, Vol. arXiv:1409.0473. Cited by: §2.
- Context-aware trajectory prediction. In 24th International Conference on Pattern Recognition (ICPR), pp. 1941–1946. Cited by: §2.
- An evaluation of trajectory prediction approaches and notes on the TrajNet benchmark. In arXiv preprint, Vol. arXiv:1805.07663. Cited by: item 4, item 6, Table 7.
- MCENET: Multi-context encoder network for homogeneous agent trajectory prediction in mixed traffic.. In arXiv preprint, Vol. arXiv/2002.05966. Cited by: §2, item 8, Table 6.
- Soft + hardwired attention: An LSTM framework for human trajectory prediction and abnormal event detection. In Neural networks : the official journal of the International Neural Network Society, Vol. 108, pp. 466–478. Cited by: §1, §2.
- PETS2009: Dataset and challenge. In Twelfth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, pp. 1–6. Cited by: §4.1.
- Convolutional sequence to sequence learning. In arXiv preprint, Vol. arXiv/1705.03122. Cited by: §1, item 1, §3.3.
- Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680. Cited by: §2.
- Social GAN: Socially acceptable trajectories with generative adversarial networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2255–2264. Cited by: §1, §2, item 1, item 2, item 3, item 5, item 2, §4.2, §4.7, Table 6, Table 7.
- Social GAN repository. Note: https://github.com/agrimgupta92/sgan Cited by: §4.1.
Situation-aware pedestrian trajectory prediction with spatio-temporal attention model. In 24th Computer Vision Winter Workshop, Cited by: §2.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: item 3.
- Social force model for pedestrian dynamics. In Physical Review E, Vol. 51, pp. 4282–4286. External Links: Cited by: §1, §2, item 4, Table 7.
- Long Short-term Memory. In Neural computation, Vol. 9, pp. 1735–80. Cited by: §2.
- STGAT: Modeling spatial-temporal interactions for human trajectory prediction. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2, item 13, Table 6.
- The trajectron: Probabilistic multi-agent trajectory modeling with dynamic spatiotemporal graphs. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2375–2384. Cited by: §2.
- BRVO: Predicting pedestrian trajectories using velocity-space reasoning. In The International Journal of Robotics Research, Vol. 34, pp. 201–217. Cited by: §1, §2.
- Social-bigat: Multimodal trajectory forecasting using bicycle-gan and graph attention networks. In Advances in Neural Information Processing Systems 32, pp. 137–146. Cited by: §2, item 10, Table 6.
- Crowds by Example. In Computer Graphics Forum, Vol. 26, pp. 655–664. Cited by: Figure 6, §4.1.
- Conditional generative neural system for probabilistic trajectory prediction. In arXiv, Vol. arXiv:1905.01631. Cited by: §2, item 9, Table 6.
- Social and scene-aware trajectory prediction in crowded spaces. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Cited by: §2.
- Path predictions using object attributes and semantic environment. In 14th International Conference on Computer Vision Theory and Applications, pp. 19–26. Cited by: §2.
- Social-stgcnn: A social spatio-temporal graph convolutional neural network for human trajectory prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14424–14432. Cited by: §2, item 12, Table 6.
- Convolutional neural networks for trajectory prediction. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §1, §3.3, item 4, §4.2, Table 6.
- You’ll never walk alone: Modeling social behavior for multi-target tracking. In IEEE 12th International Conference on Computer Vision, pp. 261–268. Cited by: §4.1, §4.2.
- A data-driven model for interaction-aware pedestrian motion prediction in object cluttered environments. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. Cited by: §1, §2, item 3.
- Learning social etiquette: Human trajectory understanding in crowded scenes. In ECCV 2016, Vol. 9912, pp. 549–565. Cited by: Figure 7, §4.1.
- Trajnet official website. Note: http://trajnet.stanford.edu/ Cited by: §4.1.
- TrajNet: Towards a benchmark for human trajectory prediction. In arXiv preprint, Cited by: §4.1.
- SoPhie: An attentive GAN for predicting paths compliant to social and physical constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1349–1358. Cited by: §1, §2, §3.3, item 6, §4.2, Table 6.
- Socially-aware graph convolutional network for human trajectory prediction. In IEEE 3rd Information Technology, Networking, Electronic and Automation Control Conference (ITNEC), pp. 325–333. Cited by: §2.
- Reciprocal n-Body collision avoidance. In Robotics Research, pp. 3–19. External Links: Cited by: §2.
- Social attention: Modeling attention in human crowds. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–7. Cited by: §1, §2.
- GD-GAN: Generative adversarial networks for trajectory prediction and group detection in crowds. In ACCV 2018: 14th Asian Conference on Computer Vision, pp. 314–330. Cited by: §2.
- Encoding crowd interaction with deep neural network for pedestrian trajectory prediction. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5275–5284. Cited by: §2.
- SS-LSTM: A hierarchical LSTM model for pedestrian trajectory prediction. In IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1186–1194. Cited by: §2, item 2.
- Location-velocity attention for pedestrian trajectory prediction. In IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 2038–2047. Cited by: §2, item 3, Table 7.
- Stochastic trajectory prediction with social graph network. In arXiv preprint, Vol. arXiv:1907.10233. Cited by: §2, item 7, Table 6.
- SR-LSTM: State refinement for LSTM towards pedestrian trajectory prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12085–12094. Cited by: item 11, item 5, Table 6, Table 7.