Forecasting future locations of humans is an essential task to develop safe autonomous agents that interact with people. Autonomous systems such as household and factory robots, security systems and self driving cars need to know where people will be given their previous movement to plan safe paths.
Predicting people’s future movement is challenging as the target location of a person cannot be known. The static and dynamic layout of the environment, be it a road network or an indoors map complicates the forecasting. Static obstacles place restrictions beyond their narrow definition of spatial extent (pedestrians will tend not to walk between closely situated parked cars). Considering dynamic obstacles like other pedestrians, bicycles or moving cars, one must not only minimize direct future collisions but also take into account hidden social rules (keeping to the right, respecting personal space). This becomes more complex since there are multiple unknown target locations each of the dynamic obstacles might be heading to. Consequently, predicting all the possible and plausible future trajectories is a multi-modal problem. Respecting the hidden rules and addressing the challenges is key for systems that are not just accurate but also safe.
Computationally tractable generative models has led to a great number of methods [10, 24, 17] that model the multi-modality of future trajectory prediction. While these models focus on producing accurate future trajectories, they do not take into account the safety of the trajectories explicitly in terms of collisions. As a result, the generated paths might be unsafe, wherein they collide with other road users, or even infeasible, i.e., they go through static scene elements like buildings, cars or trees. Consequently, these trajectories cannot easily be trusted in practice.
In this work we argue that one should take into account safety, when designing a model to predict future trajectories. The problem is that future trajectories are generated online and cannot be predetermined. Thus, applying standard and differentiable supervised learning is not straightforward. Even if one could minimize collisions with static elements and the ground truth trajectories, the model must learn to minimize collisions with the generated, and therefore unseen, future trajectories. Thus, in the absence of a clear learning objective, we rely on Generative Adversarial Networks (GAN) for the generative modelling and cast the learning within the Inverse Reinforcement Learning framework . Inverse Reinforcement Learning guides the model towards optimizing an unknown reward function and shares many similarities with GANs. Here, our unknown reward function accounts for learning safe and “collision-free” future trajectories that have not even been generated yet.
This work has three contributions: (a) to the best of our knowledge, we are the first to propose generating “safe” future trajectories, minimizing collisions with static and dynamic elements in the scene, (b) the generated trajectories are not only safer but also more accurate. The reason is that the model learns to prune out unlikely trajectories and generalize better. In fact, we show that the remaining few collisions in the trajectories strongly correlate with ground truth, thus showing that the model learns to also anticipate collisions. and (c), we rely on attention mechanism for spatial relations in the scene to propose a compact representation for modelling interaction among all agents. The representation, which we refer to as attentive scene representation, is invariant to the number of people in the scene and thus is scalable.
2 Related work
Future trajectory prediction. Predicting future trajectories of road users has recently gained an increased attention in the literature. In their pioneering work Alahi et al define future trajectory prediction as a time series modelling problem and proposed SocialLSTM, which aggregates information of multiple agents in a single recurrent model. While SocialLSTM is a deterministic model, Gupta et al proposed SocialGAN to learn to generate multi-modal future trajectories. To make sure that novel future trajectories are sampled, the encoder and decoder were trained using a Conditional Generative Adversarial Network [7, 19] learning framework, leading to impressive results.
Extending on SocialGAN, SoPhie 
incorporate semantic visual features extracted by deep networks, combined with an attention mechanism. Inspired by, DESIRE  also relies on semantic scene context and introduces a ranking and refinement module to order the randomly sampled trajectories by their likelihood. Further,  develop R2P2, which is a non-deterministic model trained by modified cross-entropy objective to balance betwee the distribution and demonstrations. SocialAttention  connects multiple RNNs in a graph structure to efficiently deal with the varying amount of people in a scene.
Future trajectory prediction models beyond the deep learning have also been explored. The Social Force is one of the first models of pedestrian trajectory prediction that was successfully applied in the real-world . Interacting Gaussian Process (GP) based methods  model the trajectory of each person as a GP with interaction potential terms that couples each person’s GP. An extensive array of comparisons is available in , based on the TrajNet  benchmark.
Here, we adopt a modified Conditional Generative Adversarial Network framework [7, 19] to generate novel future trajectories from observed past ones. Compared to [10, 24, 24, 17] our focus is to generate trajectories that are not just accurate but also lead to minimum collisions and thus are safe. Safe trajectories are different from trajectories that try to imitate the ground truth , as the latter may lead to implausible paths, e.g, pedestrians going through walls.
Generative Adversarial Networks + Reinforcement Learning. To model safe trajectories we propose a model that combines Generative Adversarial Networks Generative Adversarial Networks  and Inverse Reinforcement Learning .  are generative models that rely on the min-max optimization of two competing sub-models, the generator and the discriminator. As standard GANs do not have an encoder, Conditional GANs [7, 19] include an encoder, which can be used to generate novel samples given observations. Finn et al show that GANs resemble Inverse Reinforcement Learning  in that the discriminator learns the cost function and the generator represents the policy. Inspired by [4, 9] who showed the potentials of using external simulators as critic objectives , we propose to use a critic network to model collision minimization by Inverse Reinforcement Learning. To prevent mode collapse, an additional auto-encoding loss is considered.
Our goal is a model that predicts accurate, diverse and safe, i.ecollision-free, future trajectories for all relevant road users in a driving scene. We denote the past trajectories of all road users participating in the scene with
. Each trajectory is a sequence of vectorsfor past consecutive time steps. Each element is a , a 2D coordinate displacement vector for agent , with . We also denote the ground truth future trajectories also containing 2D displacement vectors as elements with and the predicted trajectories with . The time span of the prediction is . To reduce notation clutter, we drop the subscript and superscripts and when they can be inferred from the context.
3.2 Generating Trajectories with SafeCritic
The conditional GAN of SafeCritic comprises of two modules, a Generator, and a Discriminator, as seen in the left of Figure 1. The generator of the conditional GAN is composed of two sub-networks.
The first sub-network is a -parameterized encoder that receives as input the past trajectories of all road users and maps them to a common embedding space . The purpose of the encoder is to aggregate the current state of all road users jointly. The aggregated state of all road users is then passed to the second sub-network, which is a -parameterized decoder , where are the dynamic and static aggregated features that will be detailed in 3.4, and is the random noise vector for sampling novel trajectories. The purpose of the decoder is to generate novel future coordinates for all road users simultaneously.
The generator is followed by the discriminator, which is implemented as a third -parameterized sub-network . The discriminator receives as inputs the past trajectories , as well as the ground truth future trajectories and the predicted trajectories . Being adversaries, the generator and the discriminator are trained in an alternative manner : the generator generates gradually more realistic trajectories, while the discriminator classifies which trajectories are fake thus motivating the generator to become better.
3.3 Critic for Obtaining Safe Future Trajectories
While the generator and the discriminator modules are trained to generate plausible future trajectories, the proposed trajectories are no provisions to avoid possible collisions in the predicted future trajectories. To this end, we incorporate a Critic module responsible for minimizing the static and dynamic future collisions, as shown in the right of Figure 3.
A natural objective for the Critic module is a supervised loss that minimizes the number of future dynamic and static collisions. Unfortunately, integrating such an objective to the model is not straightforward. The number of non-colliding trajectories existing in data sets is overwhelmingly higher than for colliding trajectories. Moreover, having a generative model means that the generated trajectories will not necessarily imitate the ground truth ones.
To this end, we rely on the off-policy actor-critic framework  to define the objective of Critic . The actor is the trajectory generator decoder , taking a displacement action . The Critic then is a -parameterized sub-network that predicts how likely an eminent collision is or not, given . The action value
, is the score for the future action causing a collision. Another choice would have been to simply return the reward signal to the generator. However, in that case we would need to rely on high-variance gradients returned by REINFORCE.
Generator For training SafeCritic, we must take into account the opposing goals of the generator and the discriminator, as well as the regularizing effect of the critic. Following standard GAN procedure we arrive at a min-max training objective for the generator and the discriminator . As the discriminator quickly learns to tell the real and generated samples apart, training the generator is harder than training the discriminator, leading to slow convergence or even failure .
Inspired from , we introduce an additional auto-encoding loss to the ground truth for the best generated sample, , . An additional benefit of the auto-encoding loss is that it encourages the model to avoid trivial solutions and mode collapse, as well as to increase the diversity of future generated trajectories.
Critic For the training of the Critic we employ a version of the Deep Deterministic Policy Gradient algorithm , more specifically, the DDPG variant  that uses a sequence of observations. Regarding the training objective of the Critic sub-network, we simply opt for minimizing the squared error when a collision was predicted and when it occurred:
where indicates the presence or not of a collision and can be computed by any collision checking algorithm, here . It is important to note that following [1, 24, 10], all future predictions by the generator and the Critic are performed sequantially, one step at a time. The Critic takes as input the generated trajectory from up to time , while the collision checker algorithm computes given the generated trajectory at the next time step, .
The maximum reward is attained when the future trajectories cause minimum number of collisions. In the end, the Critic sub-network acts as a regularizer to the rest of the network, guiding the generator to generate novel trajectories that cause mimimal collisions.
3.4 Attentive Scene Representation
To generate novel, accurate, diverse and safe future trajectories, a good representation of the scene as well as the rest of the agents in the scene is required. What is more, this representation must attend to the relevant elements in the vicinity of every agent.
To this end, we propose the following tensor representations for capturing possible dynamic interactions between the agents themselves, as well as the static interactions between the agents and the environment, respectively, see Fig. 1. Both representations for dynamic and static interactions follow the same design. We center a grid around the -th agent , where and run over the height and width of the grid. We then check if a given grid location is occupied by either another agent or a static element. If the grid location is occupied by an agent , we then set the respective entry in to the hidden state of that agent, namely . Similarly for static features, if the grid location is occupied by a static element we set the respective entry in to the one-hot vector representation of that static element.
Last, we employ an attention mechanism to prioritize certain elements in the latent state representations of the -th agent. The attention is applied spatially, such that specific coordinates on the grid receive higher importance than others.
For the encoder we opt for a Long short-term memory (LSTM) network
. The observed trajectories are first embedded into a multilayer perceptron (MLP) in order to obtain a fixed length vector,, where are the displacement coordinates of the pedestrians in the scene at all time steps . The LSTM then updates its hidden states sequentially. The last hidden state of the LSTM is concatenated with the static and dynamic attentive representations and the noise input and then passed to the decoder. For the decoder we rely also on a separate LSTM network that does not share weights with the encoder. The LSTM decoder generates the hidden states of each person for predicted time steps. The final hidden state is converted into a displacement vector and added onto the previous prediction , where is a learnable transformation matrix. The critic takes predictions of agent , together with the state in which that agent was and assigns a reward for each time-step. In this way, it learns to predict the likelihood of collision at , which is dependant on the motion of agent , but also of another agent . If there are multiple approaching neighbors, chances of a collision are higher and since we want to reward avoiding collisions especially in these difficult situations, we do not normalize over the number of agents. We implement the Critic as an LSTM and an MLP and train it with Mean Squared Error between the number of collisions.
We evaluate our method on two state-of-the-art datasets used for future trajectory prediction of pedestrians.
Data UCY  has 3 videos, recorded from a bird’s eye point of view in 2 different outdoor locations. It contains annotated positions for pedestrians only. The average length of a video is 361 seconds and it has 250 approximately people per video. Following we use the leave-one-out approach during training. The Stanford Drone Dataset  contains annotated positions of various types of agents that navigate inside a university campus. On average, there exist 104 agents per video. We use the standard train and test splits provided by TrajNet .
Last, to evaluate the capacity of the methods to avoid collisions, we measure the number of collisions with dynamic and static elements. The number of collisions is computed as where is the kronecker delta function, is the indicator function, and (here and are two different agents).
Implementation details Following standard protocol  we observe trajectories for times steps (3.2 seconds). We then predict the next
(4.8 seconds) time steps. The encoder and decoder LSTM have an embedding size of 32 each. We use batch normalization and Adam optimization  with learning rates for the generator and for the Critic and the discriminator. Like  we bound and normalize distances between agents and the boundary points within a neighborhood.
, as well as a linear regressor that estimates parameters by minimizing the least square error and Social Force implemented in.
4.1 Evaluating attentive scene representation features
First, we evaluate whether the proposed attentive scene representations allow for better generation of future trajectories. In Table 1 we show results for trajectory generation when we turn on and off the proposed static and dynamic pooling, while also turning off the Critic module to minimize external influences. Following dataset conventions, errors are measured in meters for UCY and pixels for SDD. We observe that when switching on the proposed attentive scene representations we obtain lower errors in terms of mADE and mFDE for both UCY and SDD. We conclude that the attentive scene representations allows for more accurate future trajectory generation.
4.2 Evaluating the safety of SafeCritic
Next, we examine how safe the generated trajectories are in terms of the number of collisions caused. We report results in Table 2. A collision is defined with a threshold of 0.10m between two locations. The videos in SDD are less crowded and there are no people within a proximity of 0.10 m. We make two observations. Under all settings and data sets SafeCritic generates three times fewer collisions compared to SocialGAN, Sophie and a linear regressor. What is more, SafeCritic is quite close to the number of the collisions that are actually observed in the ground truth trajectories. This is important because collisions may actually happen in real life and thus a model should also generate collisions when these are likely to occur.
4.3 Evaluating the accuracy of SafeCritic
Next, we examine how accurate the generated future trajectories are. We report results in Table 3 for UCY and Table 4 for SDD. We observe that SafeCritic generates more accurate future trajectories than all other baselines in UCY.
Importantly, SafeCritic comfortably outperforms all methods on the much larger and complex SDD dataset. We also evaluate the diversity of the generated trajectories in Fig. 2. We conclude that SafeCritic predicts future trajectories that are not only accurate, but also cause much fewer collisions and that are more diverse.
4.4 Qualitative results
Last, we present some qualitative results for SafeCritic. In Fig. 2
we plot various generated trajectories at a specific moment in time, examining whether mode collapse is happening. We observe that SafeCritic does not suffer from mode collapse, as also quantified earlier.
Further, in Fig. 3 we show that the generated future trajectories are nicely sampled such that the attended agents and static elements are avoided and no collisions are generated.
Last, in Fig. 4 we show how the predicted locations of SafeCritic do not collide. The second row shows how multiple sampled trajectories of SafeCritic move around a lawn.
In this work we introduced SafeCritic, a model that is based on training in an adversarial way along with a reinforcement learning based reward network to predict both accurate and safe trajectories. A novel way of handling static and dynamic features in both the generator and reward network makes it possible to discard unsafe trajectories. Our proposed collision metric enables the evaluation of safety aspect of predicted trajectories. We show that SafeCritic outperforms baseline models both on accuracy as well as on the number of collisions. Furthermore, our experiments show that a convex combination of trajectory’s ‘safe’ and ‘real’ aspects achieves a perfect balance for the final prediction.
-  (2016) Social lstm: human trajectory prediction in crowded spaces. In , Cited by: §2, §3.3, §4, §4, §4, §4.
-  (1984) Neuron-like adaptive elements that can solve difficult learning control-problems. Behavioural Processes 9 (1). Cited by: §3.1.
-  (2018) An evaluation of trajectory prediction approaches and notes on the trajnet benchmark. arXiv preprint arXiv:1805.07663. Cited by: §2.
-  (2018) MolGAN: an implicit generative model for small molecular graphs. arXiv preprint arXiv:1805.11973. Cited by: §2.
-  (2004) Real-time collision detection. CRC Press. Cited by: §3.3.
A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. CoRR abs/1611.03852. External Links: Cited by: §2.
Conditional generative adversarial nets for convolutional face generation.
Class Project for Stanford CS231N: Convolutional Neural Networks for Visual Recognition, Winter semester2014 (5), pp. 2. Cited by: §2, §2, §2.
-  (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §2, §3.2, §3.3.
-  (2017) Objective-reinforced generative adversarial networks (organ) for sequence generation models. arXiv preprint arXiv:1705.10843. Cited by: §2.
-  (2018) Social gan: socially acceptable trajectories with generative adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §2, §3.1, §3.3, §3.3, §3.5, §4, §4.
Memory-based control with recurrent neural networks. arXiv preprint arXiv:1512.04455. Cited by: §3.3.
-  (1995) Social force model for pedestrian dynamics. Physical review E 51 (5), pp. 4282. Cited by: §2.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §4.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
-  (2000) Actor-critic algorithms. In Advances in neural information processing systems, pp. 1008–1014. Cited by: §2.
-  (2014) Learning an image-based motion context for multiple people tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3542–3549. Cited by: §4.
-  (2017) Desire: distant future prediction in dynamic scenes with interacting agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 336–345. Cited by: §1, §2, §2, §4.
-  (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §3.3.
-  (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §2, §2, §2.
-  (2000) Algorithms for inverse reinforcement learning. In ICML, Cited by: §1, §2.
-  (2018) R2p2: a reparameterized pushforward policy for diverse, precise generative path forecasting. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 772–788. Cited by: §2, §2.
-  (2016) Learning social etiquette: human trajectory understanding in crowded scenes. In European conference on computer vision, pp. 549–565. Cited by: §4.
-  (2018) TrajNet: towards a benchmark for human trajectory prediction. arXiv preprint (), pp. . Cited by: §2, §4, §4.
-  (2018) SoPhie: an attentive gan for predicting paths compliant to social and physical constraints. arXiv preprint arXiv:1806.01482. Cited by: §1, §2, §2, §3.3, §4, §4.
-  (2017) CAR-net: clairvoyant attentive recurrent network. arXiv preprint arXiv:1711.10061. Cited by: §4.
-  (2016) Improved techniques for training gans. In Advances in Neural Information Processing Systems, pp. 2234–2242. Cited by: §3.3, §4.4.
-  (2008) Real-time navigation of independent agents using adaptive roadmaps. In ACM SIGGRAPH 2008 classes, pp. 56. Cited by: §2.
-  (1998) Introduction to reinforcement learning. Vol. 135, MIT press Cambridge. Cited by: §3.3.
-  (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §3.1.
-  (1984) Temporal credit assignment in reinforcement learning. Cited by: §3.1.
-  (2010) Unfreezing the robot: navigation in dense, interacting crowds. In Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ International Conference on, pp. 797–803. Cited by: §2.
Social attention: modeling attention in human crowds. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–7. Cited by: §2.