Forecasting the motion of pedestrians in crowds is essential for autonomous systems like self-driving cars and social robots that will potentially co-exist with humans. To successfully predict how humans navigate in crowds, a forecasting model needs to tackle three crucial challenges:
(1) Modelling social interactions: the model should learn how the trajectory of one person affects another person;
(2) Physically acceptable outputs: the model predictions should be physically acceptable, i.e., not undergo collisions;
(3) Multimodality: given the history, the model needs to be able to output all futures without missing any mode.
The objective of multi-modal trajectory forecasting is to learn a generative model over future trajectories. Generative adversarial networks (GANs) [Goodfellow2014GenerativeAN] are a popular choice of generative models for trajectory forecasting as they can effectively capture all possible future modes by mapping samples from a given noise distribution to samples in real data distribution. Gupta et al. [Gupta2018SocialGS] proposed Social GAN (SGAN), GANs with social mechanisms, to learn human interactions and output multimodal trajectories. Following the success of SGAN, recent works [Kosaraju2019SocialBiGATMT, Sadeghian2018SoPhieAA, Zhao2019MultiAgentTF, Amirian2019SocialWL] have proposed improved GAN architectures to better model human interactions in crowds. Indeed, these designs have been successful in reducing the distance-based metrics on real-world datasets [Kosaraju2019SocialBiGATMT]. However, we discover that they fail to model social interactions i.e., the models output colliding trajectories.
The failure to output collision-free trajectories can be attributed to the fact that the current discriminator designs do not fully model human-human interactions; hence they are incapable of differentiating real trajectory data from fake data. Only when the discriminator is capable of differentiating real data from fake data, can the supervised signal from it be meaningful to teach the generator. To tackle this issue, we propose two architectural changes to the SGAN design: (1) Spatio-temporal interaction modelling to better discriminate between real and generated trajectories. (2) A transformer-based discriminator design to strengthen the sequence modelling capability and better guide the generator training. Equipped with these structural changes, our proposed architecture SGANv2, learns to better model the underlying etiquette of human motion as evidenced by reduced collisions.
To further reduce the prediction collisions, SGANv2 leverages the trained discriminator even at test time. In particular, we perform collaborative sampling [Liu2019CollaborativeGS] between the generator and discriminator at test-time to guide the unsafe trajectories sampled from the generator. Additionally, we empirically demonstrate that collaborative sampling not only helps to refine trajectories but also has the potential to prevent mode collapse, a phenomenon where the generator fails to capture all modes in the output distribution.
We empirically validate the efficacy of SGANv2 in outputting socially compliant predictions on both synthetic and real-world trajectory datasets. First, we shed light on the shortcomings of the current metric commonly used to measure the multimodal performance, namely Top-20 ADE/FDE [Gupta2018SocialGS]. Specifically, we demonstrate that a simple predictor that outputs uniformly spaced predictions performs at par with the state-of-the-art methods when evaluated using only Top-20 ADE/FDE. To counter this limitation, we propose an alternate evaluation scheme to better measure the socially-compliant multimodal performance of a model. We demonstrate that SGANv2 outperforms competitive baselines on both synthetic and real-world trajectory datasets under the new evaluation scheme. Finally, we demonstrate the ability of collaborative sampling to prevent mode collapse on the recently released Forking Paths [liang2020garden] dataset. Our main contributions are:
We propose SGANv2, an improved SGAN architecture that incorporates spatio-temporal interaction modelling in both the generator and the discriminator. Moreover, our transformer-based discriminator better guides the learning process of the generator.
We demonstrate the efficacy of collaborative sampling between the generator and discriminator at test-time to reduce prediction collisions and prevent mode collapse in trajectory forecasting.
Ii Related Work
Human trajectory forecasting in crowds has been an active area of research [SocialForce, Alahi2016SocialLH, Li2020SocialWaGDATIT, Huang2019STGATMS, Mohamed2020SocialSTGCNNAS, Zhu2019StarNetPT, Giuliari2020TransformerNF, Yu2020SpatioTemporalGT, Su2022TrajectoryFB, Zhang2019SRLSTMSR, Kothari2021InterpretableSA, KothariAdversarialLF, Liu2021SocialNC, Daniel2021PECNetAD, saadatnejad_sattack, Liu2022CausalMotionRepresentations] for various applications like autonomous systems [WaymoSafety, UberSafety, Chen2019CrowdRobotIC, Rasouli2020AutonomousVT] and advanced surveillance [Mehran2009AbnormalCB]. In this section, we review model designs that learn social interactions and output socially compliant multimodal outputs. Table I provides a high-level overview of how SGANv2 architecture differs from selected generative model-based designs.
Spatio-temporal interaction modelling. The seminal work of Social LSTM [Alahi2016SocialLH] proposed to learn spatial interactions in a data-driven manner with a novel social pooling layer. Following the success of Social LSTM, various designs of data-driven interaction modules have been proposed [Pfeiffer2017ADM, Shi2019PedestrianTP, Bisagno2018GroupLG, Gupta2018SocialGS, Zhang2019SRLSTMSR, Zhu2019StarNetPT, Ivanovic2018TheTP, Liang2019PeekingIT, Tordeux2019PredictionOP, Ma2016AnAI, Hasan2018MXLSTMMT, Mohamed2020SocialSTGCNNAS, Li2020EvolveGraphHM, Yu2020SpatioTemporalGT] to effectively model interactions in crowds. For a detailed taxonomy on the designs of interaction modules, one can refer to Kothari et al. [Kothari2020HumanTF]. In this work, we highlight the importance of modelling both the spatial and temporal nature of social interactions.
Architectures that model dynamics of entities in spatio-temporal tasks have been well-studied. Structural-RNN [Jain2016StructuralRNNDL], a specialized RNN design, proposed to model dynamics in spatio-temporal tasks like human-object interaction and driver maneuver anticipation. Specific to motion forecasting, several works consider the temporal evolution of spatial human interactions using recurrent mechanisms [Vemula2017SocialAM, Huang2019STGATMS, Li2020EvolveGraphHM], graph convolutional networks [Mohamed2020SocialSTGCNNAS, Sun2020RecursiveSB] as well as transformers [Yu2020SpatioTemporalGT]. However, many recent works advocated performing spatial interaction modelling only at the end of observation [Gupta2018SocialGS, Kosaraju2019SocialBiGATMT], as this strategy did not impact the distance-based metrics and saved computational time. In this work, we study the importance of spatio-temporal interaction modelling from the perspective of reducing the collisions in model outputs.
Multimodal forecasting.Neural networks trained using loss are condemned to output the average of all possible outcomes. To tackle this, one line of work proposes loss variants [GuzmnRivera2012MultipleCL, Rupprecht2016LearningIA, Makansi2019OvercomingLO, Huang2019STGATMS] capable of handling multiple hypotheses. However, these variants fail to penalize low quality predictions, for e.g. samples that are far away from the ground truth and undergo collisions. Thus, training using these variants can result in high diversity but low quality predictions.
Another line of work utilizes generative models [Lee2017DESIREDF, Ivanovic2018TheTP, Gupta2018SocialGS, Amirian2019SocialWL, Kosaraju2019SocialBiGATMT, Huang2021STIGANMP]
, with Variational Autoencoders (VAEs) and Generative Adversarial networks (GANs) being the most popular, to model future trajectory distribution. VAE models in trajectory forecasting[Lee2017DESIREDF, Ivanovic2018TheTP] employ a loss objective based on different variants of the euclidean distance. Such a formulation leads to low quality samples especially when the predictions are uncertain [Dosovitskiy2016GeneratingIW]
. On the other hand, the discriminator of the GAN framework acts as a learned loss function that naturally penalizes the low quality samples under the adversarial training objectivei.e. penalty is incurred on the generator if a sample does not look real [Goodfellow2014GenerativeAN]. Thus, we choose GANs as our generative model as they can effectively produce diverse and high-quality modes by transforming samples from a noise distribution to samples in the real data.
GANs in trajectory forecasting. SGAN [Gupta2018SocialGS] used an LSTM encoder-decoder with social mechanisms within the GAN framework [goodfellow_generative_2014] to perform multimodal forecasting. Following the success of SGAN, various GAN-based architectures have been proposed to better model multimodality in crowds [Li2019WhichWA, Kosaraju2019SocialBiGATMT, Amirian2019SocialWL] as well as on roads [Roy2019VehicleTP, Jin2022AGS]. Yuke Li [Li2019WhichWA] proposed to infer the latent decisions of the agents to model multimodality. Kosaraju et. al. [Kosaraju2019SocialBiGATMT] proposed to introduce two discriminators: a local discriminator for the local pedestrian trajectories, similar to [Amirian2019SocialWL, Gupta2018SocialGS], and a global discriminator that accounted for the spatial interactions. All these works exhibit two common design choices: (1) they do not perform spatio-temporal interaction modelling within the discriminator, (2) they utilize a recurrent LSTM-based discriminator.
It is crucial to equip the discriminator with the ability to model spatio-temporal interactions. Therefore, SGANv2 performs spatio-temporal interaction modelling within the discriminator, along with the generator. Transformers [Vaswani2017AttentionIA] have been shown to outperform RNNs in almost all sequence modelling tasks, including trajectory forecasting [Giuliari2020TransformerNF, Yu2020ImprovedOI]. Therefore, we design our discriminator using the transformer and demonstrate that it better guides the generator training. Giuliari et al. [Giuliari2020TransformerNF] do not take into account social interactions leading to high collisions in the outputs. The spatio-temporal transformer design of STAR [Yu2020SpatioTemporalGT] is most closely related to the design of our discriminator. However, as discussed above, their loss training objective can fail to effectively model multimodality. Further, in contrast to previous transformer and GAN-based works, SGANv2 performs test-time refinement that leads to further collision reduction, discussed next.
Test-time Refinement. refers to the task of refining model predictions at test-time. Lee et al. [Lee2017DESIREDF] propose an inverse optimal control based module to refine the predicted trajectories. Sun et al. [Sun2020ReciprocalLN] refine trajectories using a reciprocal network that reconstructs input trajectories given the predictions. However, they rely on the strong assumption that both forward and backward trajectories follow identical rules of human motion. We propose to refine trajectories by performing collaborative sampling between the trained generator and discriminator [Liu2019CollaborativeGS]. This technique provides theoretical guarantees with respect to moving the generator distribution closer to real distribution.
Mode Collapse. is the phenomenon where the generator distribution fails to capture all modes of target distribution. SGAN collapses to a single mode of behavior. Social Ways [Amirian2019DataDrivenCS] utilizes InfoGAN that overcomes this issue albeit on a toy dataset. We empirically show that the collaborative sampling technique in SGANv2 overcomes mode collapse on the more-diverse Forking Path dataset [liang2020garden].
Modelling human trajectories using generative adversarial networks (GANs) has the potential to learn the underlying etiquette of human motion and output realistic multimodal predictions. Indeed, recent GAN-based trajectory forecasting models have been successful in reducing distance-based metrics, however they suffer from high prediction collisions. In this section, we present SGANv2, an improvement over the SGAN architecture to output safety-compliant predictions. On a high level, we propose three structural changes: (1) Spatio-temporal interaction modelling within the discriminator and generator to better understand social interactions, (2) Transformer-based discriminator to better guide the generator, (3) Collaborative sampling mechanism between the generator and discriminator to refine the colliding trajectories at test-time. Our proposed changes are generic and can be employed on top of any existing GAN-based architecture.
Iii-a Problem Definition
Given a scene, we receive as input the trajectories of all people within the scene denoted by , where is the number of people in the scene. The trajectory of a person , is defined as , for time and the future ground-truth trajectory is defined as for time . The objective is to accurately and simultaneously forecast the future trajectories of all people , where is used to denote the predicted trajectory of person . The velocity of a pedestrian at time-step is denoted by .
Iii-B Generative Adversarial Networks
GANs consist of two neural networks, namely the generator and the discriminator , which are trained together in tandem. The objective of is to correctly identify whether a sample belongs to the real data distribution or is generated by the generator. The objective of is to produce realistic samples which can fool the discriminator. takes as input a noise vector sampled from a given noise distribution and transforms it into a real looking sample
outputs a probability score indicating whether a sample comes from the generator distributionor the real data distribution . Training GANs is essentially a minimax game between the generator and the discriminator:
Iii-C Interaction Modelling Designs
Modelling social interactions is the key to outputting safe and accurate future trajectories. In this work, we argue that current works do not model interactions between agents sufficiently within both the generator and discriminator leading to large number of prediction collisions. Here, we differentiate between the notion of performing spatial interaction modelling and performing spatio-temporal interactions modelling. An architectural design is said to perform spatial interaction modelling if it models the interaction between pedestrians at a single time-step only. For instance, SGAN performs spatial interaction modelling within the generator as it encodes the neighbourhood information only once, at the end of the observation. On the other hand, an architectural design is said to perform spatio-temporal interaction modelling if it performs the spatial interaction modelling at every time-step (from to ) and the temporal evolution of the interactions are captured using any sequence encoding mechanism, e.g. an LSTM or a Transformer. We empirically demonstrate that spatio-temporal interactions modelling within both the generator and the discriminator are essential to output safer trajectories.
We now describe our proposed model design in detail (see Fig. 2). Our architecture consists of three key components: the Spatial Interaction embedding Module (SIM), the Generator (G), and the Discriminator (D). SIM is responsible for spatial interaction modelling while the G and D perform temporal modelling. Thus, G and D in congregation with SIM perform spatio-temporal interaction modelling (STIM). In particular, SIM performs motion embedding and spatial interaction embedding for each pedestrian at each time-step. G encodes the embedded sequence through time and outputs multimodal predictions using an LSTM encoder-decoder framework. D, modelled using a transformer [Vaswani2017AttentionIA], inputs the entire sequence comprising the observed trajectory and the future prediction (or ground-truth
), and classifies it as real/fake.
Spatial Interaction Embedding Module. One important characteristic that differentiates human motion forecasting from other sequence prediction tasks is the presence of social interactions: the trajectory of a person is affected by other people in their vicinity. SIM performs the task of encoding human motion and human-human interactions in the spatial domain at a particular time-step. We embed the velocity of pedestrian at time using a single layer MLP to get the motion embedding vector given as:
where is the embedding function with weights .
The design of SIM is flexible and it can utilize any spatial interaction module proposed in literature (e.g. [Kothari2020HumanTF, Kosaraju2019SocialBiGATMT]). It embeds the spatial configuration of the scene and outputs the interaction embedding for pedestrian at time-step . We then concatenate the motion embedding with the spatial interaction embedding, i.e. , and provide the concatenated embedding to the G (or the D). The input embedding is constructed using the ground-truth observations from , and generator predictions from .
Generator. Within the generator, the encoder LSTM encodes the input embedding sequence provided by the SIM. The encoder LSTM helps to model the temporal evolution of spatial interactions in the form of the following recurrence:
where denotes the hidden state of pedestrian at time , are the weights of encoder LSTM that are learned.
The output of the LSTM encoder for each pedestrian at the end of the observation period represents his/her observed scene representation. Similar to SGAN, we utilize this representation to condition our GAN for prediction. In other words, SGANv2 take as input noise
and the observed scene representation to produce future trajectories that are conditioned on the past observations. The decoder hidden-state of each pedestrian is initialized with the final hidden-state of the encoder LSTM. The input noiseis concatenated with the inputs of the decoder LSTM, resulting in the following recurrence for the decoder LSTM:
where are the weights of decoder LSTM.
The decoder hidden-state at time-step of pedestrian is then used to predict the next velocity at time-step . Similar to Alahi et al.[Alahi2016SocialLH]
, we model the next velocity as a bivariate Gaussian distribution parametrized by the meanand correlation coefficient :
where is an MLP and is learned.
Discriminator. The social interactions between humans evolve with time. Therefore, we design our discriminator to perform spatio-temporal interaction modelling. Also, in recent times, transformers [Vaswani2017AttentionIA] have become the de-facto model for modelling temporal sequences, replacing recurrent architectures [Giuliari2020TransformerNF, Yu2020SpatioTemporalGT]. Therefore, we design the discriminator as a transformer to perform the temporal sequence modelling of the output provided by SIM.
The discriminator takes as input or and classifies them as real/fake. The discriminator has its own SIM, which provides the spatial interaction embedding for each pedestrian at each time-step in the input sequence. Instead of passing through an LSTM (similar to the generator), we stack these embedded vectors together to form an embedded sequence for each pedestrian (similar to an embedded sequence obtained after embedding word tokens in the field of natural language [Vaswani2017AttentionIA]):
This sequence is given as input to the encoder of the transformer proposed in [Vaswani2017AttentionIA]. The ability of transformers to capture the temporal correlations within the spatial interaction embedding lies mainly in its self-attention module. Within the attention module, each element of the sequence is decomposed into query (Q), key (K) and value (V). The matrix of outputs is computed using the following equation [Vaswani2017AttentionIA]:
where is the dimension of the SIM embedding . The output of the attention layer is normalized and passed through a feedforward layer to obtain the latent representation of the input sequence, denoted by :
where the weights are learned, represents matrix multiplication and denotes the normalized representation of the output of the attention module. We utilize the last element of , as the representation of the input sequence. This embedding gets scored using an MLP to determine if the sequence is real or fake.
As mentioned earlier, SGANv2 is a conditional GAN model. It takes as input noise vector , sampled from , and outputs future trajectories conditioned on the past observations . We found the least-square training objective [Mao2017LeastSG] to be effective in training SGANv2:
Additionally, we utilize the variety loss [Gupta2018SocialGS] to further encourage the network to produce diverse samples. For each scene, we generate output predictions by randomly sampling and penalize the prediction closest to the ground-truth based on L2 distance.
Following the strategy in [Kothari2020HumanTF], the generator predicts only the trajectory of the pedestrian of interest in each scene and uses the ground-truth future of neighbours during training. During test time, we predict the trajectories of all the pedestrians simultaneously in the scene. All the learnable weights are shared between all pedestrians in the scene.
Iii-F Collaborative Sampling in GANs
The common practice in GANs is to sample from the generator and discard the discriminator during test time. However, our trained discriminator has knowledge regarding the social etiquette of human motion. We can utilize this knowledge to refine the bad predictions proposed by the generator. We define a prediction as bad if the pedestrian of interest undergoes collision in the model prediction. We propose to refine such trajectories by performing collaborative sampling [Liu2019CollaborativeGS] between the generator and discriminator, as demonstrated in Fig. 3.
|Top 3||Top 20||Col||Top 3||Top 20||Col||Top 3||Top 20||Col||Top 3||Top 20||Col||Top 3||Top 20||Col|
|Uniform Predictor (UP)||1.1/2.2||0.6/0.9||3.3||0.5/0.9||0.2/0.4||5.1||0.6/1.3||0.3/0.6||15.7||0.5/1.0||0.3/0.6||4.7||0.4/0.8||0.2/0.4||7.5|
To summarize collaborative sampling for the case of trajectory forecasting, our goal is to refine the generator prediction using gradients from the discriminator without updating the parameters of the generator. We leverage the gradient information provided by the discriminator to continuously refine the generator predictions of the pedestrian of interest through the following iterative update:
where is the iteration number, is the stepsize, is the loss of the generator in Eq. 9. The authors demonstrate that the above iteration process theoretically, under mild assumptions, shifts the learned generator distribution towards the real distribution [Liu2019CollaborativeGS]. The trajectories are updated till either the discriminator score goes above a defined threshold or the maximum number of iterations is reached.
In this section, we highlight the ability of SGANv2 to output socially-compliant multimodal futures. We evaluate the performance of our architecture against several state-of-the-art methods on the ETH/UCY datasets [Pellegrini2009YoullNW, Lerner2007CrowdsBE] and on the interaction-centric TrajNet++ benchmark [Kothari2020HumanTF]. Additionally, we highlight the potential of collaborative sampling to prevent mode collapse on the Forking Paths [liang2020garden] dataset. We evaluate two variants of our model against various baselines:
SGANv2 w/o CS: Our GAN architecture comprising of a transformer-based discriminator that performs spatio-temporal interaction modelling.
SGANv2: Our complete GAN architecture in combination with collaborative sampling at test-time.
Iv-a Evaluation Metrics
Top-k Average Displacement Error (ADE): Average distance between ground truth and closest prediction (out of k samples) over all predicted time steps.
Top-k Final Displacement Error (FDE): The distance between the final destination of closest prediction (out of k samples) and the ground truth final destination at the end of the prediction period .
Prediction collision (Col) [Kothari2020HumanTF]: The percentage of collision between the primary pedestrian and the neighbors in the forecasted future scene.
Iv-B Limitations of current multimodal evaluation scheme
Current multimodal forecasting works utilize metrics that measure model performance at the individual level such as the top- ADE/FDE [Gupta2018SocialGS, Huang2019STGATMS]. This metric evaluates the quality of the predicted distribution per pedestrian; and does not measure the interaction between different pedestrians. Further, the value of is very high (k=20 being most common). Almost all the recent works [Gupta2018SocialGS, Kosaraju2019SocialBiGATMT, Daniel2021PECNetAD, Huang2019STGATMS, Giuliari2020TransformerNF] in human trajectory forecasting utilize the Top-20 ADE/FDE metric [Gupta2018SocialGS] to quantify multimodal performance. We argue that measuring multimodal performance based solely on this metric can be misleading.
The Top-20 ADE/FDE metric can be easily cheated by predicting a high entropy distribution that covers all the space but is not precise [Eghbalzadeh2017LikelihoodEF]. We empirically validate this claim by comparing state-of-the-art baselines against a simple hand-crafted uniform predictor (UP). UP takes as input the last observed velocity of each pedestrian and outputs 20 uniformly spread trajectories (see Fig 4). UP outputs 20 predictions using the combination of 5 different relative direction profiles (in degrees relative to current direction of motion) and 4 different relative speed profiles (factors of the current speed).
Table II compares the performance of recent state-of-the-art methods [Giuliari2020TransformerNF, Huang2019STGATMS, Mohamed2020SocialSTGCNNAS] and UP on ETH-UCY datasets. It is apparent that by observing the Top-20 metric only, UP seems to perform better (or at par) against the state-of-the-art baselines. If we note the prediction collisions, it is apparent that UP is not a good multimodal predictor. This corroborates our conjecture that a high entropy distribution can easily cheat the Top-20 metric leading to incorrect conclusions.
Iv-C Multimodal Evaluation Scheme
To counter the above issues with current multimodal evaluation strategy, we propose to set to a lower value in our experiments; as a lower
is a better proxy for likelihood estimation for implicit generative models[Eghbalzadeh2017LikelihoodEF]. Specific to our problem, we will demonstrate that when is low (), the uniform predictor due to a lack of modeling social interactions performs poorly compared to interaction-based baselines [Huang2019STGATMS, Mohamed2020SocialSTGCNNAS]. Further, to measure the interaction-modelling capability, we focus on the percentage of collisions between the primary pedestrian and the neighbors in the forecasted future scene.
Iv-D Synthetic Experiments
We first demonstrate the efficacy of our proposed architectural changes in SGANv2 compared to other generative model designs in the TrajNet++ synthetic setup. We observe that SGANv2 greatly improves upon the Top-3 ADE/FDE metric with a lower collision metric compared to training a model using only variety loss (see Table III).
|SGANv2 w/o CS [Ours]||0.2/0.4||1.9|
|Top 3||Col||Top 3||Col||Top 3||Col||Top 3||Col||Top 3||Col|
|SGANv2 w/o CS [Ours]||1.0/1.9||1.7||0.4/0.7||1.4||0.6/1.3||11.5||0.4/0.8||2.1||0.3/0.7||3.6|
Next, we utilize collaborative sampling technique to refine trajectories that undergo collision at test-time. The trained discriminator provides feedback to the colliding samples which helps to reduce the collisions. For each colliding prediction, we perform 5 refinement iterations with stepsize 0.01. We observe that this scheme greatly reduces the collision rate by 70%. The first row of Fig 5 illustrates the ability of collaborative sampling to refine predictions in the synthetic scenario.
|SGANv2 w/o CS [Ours]||0.5/1.0||3.1|
Iv-E Real-World Experiments
Next, we evaluate the performance of our SGANv2 architecture in real-world datasets of ETH/UCY and the TrajNet++ benchmark. For ETH/UCY, we observe the trajectories for 8 times steps (3.2 seconds) and show prediction results for 12 (4.8 seconds) time steps. For TrajNet++, we observe the trajectories for 9 times steps (3.6 seconds) and show prediction results for 12 (4.8 seconds) time steps.
Table IV provides the quantitative evaluation of various baselines and state-of-the-art forecasting methods on the ETH/UCY dataset. We observe that SGANv2 outputs safer predictions in comparison to competitive baselines without compromising on the prediction accuracy. Our Top-3 ADE/FDE are on par with (if not better than) state-of-the-art methods while our collision rate is significantly reduced thanks to spatio-temporal interaction modelling. It is further interesting to note that Trajectory Transformer [Giuliari2020TransformerNF] and the simple uniform predictor (UP) that performed the best on Top-20 ADE/FDE in Table II are not among the top performing methods when evaluated on the more-strict Top-3 ADE/FDE. Next, we benchmark on the TrajNet++ with interaction-centric scenes with a standardized evaluator that provides a more objective comparison [Kothari2020HumanTF].
Table V compares SGANv2 against other competitive baselines on TrajNet++ real-world benchmark. The first part of Table V reports simple baselines and the top-3 official submissions on AICrowd made by different works literature [Liu2021SocialNC, Daniel2021PECNetAD, Kothari2020HumanTF]. SGANv2 performs at par with the top-ranked PECNet [Daniel2021PECNetAD] on the Top-3 evaluation while having 3x lower collisions demonstrating that spatio-temporal interaction modelling is key to outputting safer trajectories 111PECNet performs spatial interaction modelling once at end of observation
. Additionally, we utilize the open-source implementation of three additional state-of-the-art methods (denoted by) and evaluate them on the TrajNet++ benchmark. Compared to these competing baselines, SGANv2 improves upon the Top-3 ADE/FDE metric by 10% and the collision metric by 40%.
We perform collaborative sampling to refine trajectories that undergo collision in real world datasets. For each colliding prediction, we perform 5 refinement iterations with stepsize 0.01. We observe that this procedure reduces the collision rate by 30% on both ETH/UCY and TrajNet++. The trained discriminator understands human social interactions, and provides feedback to the bad samples, and consequently helps to reduce collisions. The second row of Fig 5 illustrates a few real-world scenarios where collaborative sampling demonstrates the ability to refine generator predictions that undergo collisions. In conclusion, we observe that SGANv2 beats competitive baselines in generating socially-compliant trajectories without compromising on the distance-based metrics.
Iv-F Ablation: Interaction Modelling
In Table VI, we empirically demonstrate that modelling interactions is the key to reducing prediction collisions. We consider the performance of different variants of our proposed SGANv2 architecture based on the interaction modelling schemes within the generator and discriminator. It is apparent that modelling interaction within both the generator and discriminator is necessary to output safe multimodal trajectories.
|TrajNet++ Synth||TrajNet++ Real|
|Top 3||Col||Top 3||Col|
|✗||✗||0.3 / 0.5||18.3||0.5 / 1.1||9.6|
|✓||✗||0.2 / 0.4||4.1||0.5 / 1.0||3.9|
|✓||✓||0.2 / 0.4||2.9||0.5 / 1.0||3.1|
Iv-G Multimodal Analysis
In this final experiment, we demonstrate the potential of collaborative sampling to prevent mode collapse in trajectory generation. We utilize the sample scene ‘Zara01’ from the Forking Paths dataset. We choose this scene as the multimodal futures of the ‘Zara01’ scene is only affected by social interactions, and not physical obstacles. It forms the ideal test ground to check the multimodal performance of forecasting models. In this experiment, we observe the trajectories for 8 times steps (3.2 seconds) and show prediction results for 13 (5.2 seconds) time steps.
Fig. 6 qualitatively illustrates the performance of a GAN model trained using variety loss [Rupprecht2016LearningIA, Gupta2018SocialGS] and other GAN objectives on the chosen scene. As there are 4 dominant modes in the scene, we chose for the variety loss. The model trained using variety loss (Fig. 5(b)
) ends up learning a uniform distribution,i.e., high diversity and low quality, as there is no penalty on the bad samples during training. Variety loss only penalizes the sample closest to the ground-truth. SGAN training [Gupta2018SocialGS] (Fig. 5(c)) results in mode collapse, i.e., low diversity and high quality as standard GAN training is highly unstable. Social Ways [Amirian2019SocialWL] proposed infoGAN objective [chen_infogan:_2016] to mitigate the mode collapse issue. The InfoGAN improves upon SGAN, however, it still fails to cover all the modes (Fig. 5(d)).
Empirically, we found that training SGANv2 with the gradient penalty objective (Fig. 5(e)), proposed in [arjovsky_wasserstein_2017], provides a better mode coverage compared to InfoGAN, but the resulting distribution is still not accurate. As shown in Fig. 5(f), our proposed collaborative sampling at test-time helps to improve the accuracy of the SGANv2 predictions, recovering modes with low coverage. The trained discriminator guides the generated samples to these modes. Thus, we see that collaborative sampling is not only effective in refining trajectories at test time, but also can help to prevent mode collapse.
Iv-H Key attributes
We now analyze the performance of the key SGANv2 design choices in the TrajNet++ synthetic setup. In the synthetic setup, we have access to the goals of each agent, allowing us to calculate Distance-to-Goal (Dist2Goal) [Ma2017ForecastingID], defined as the L2 distance between the predicted final destination and the goal of the agent.
Rationale behind Distance to Goal: It is possible that the generator predicts a socially-acceptable mode that does not correspond to the ground-truth mode (see Fig. 7). If we calculate the ADE/FDE with respect to the ground-truth for such a predicted mode (that differs from ground-truth), the numbers will be high, misleading us to incorrectly conclude that the generator did not learn the underlying task of trajectory forecasting. However, if the predicted destination is close to the goal of the agent, then one can assert that a different but socially acceptable mode has been predicted. The Col metric will help to validate that no collisions take place. Thus, Dist2Goal in combination with the Col metric helps to validate that a predicted mode is socially plausible.
Table VII quantifies the performance of various GAN architectures trained without variety loss [Rupprecht2016LearningIA]. SGAN [Gupta2018SocialGS] performs the worst on the Col metric as the discriminator does not perform any interaction modelling, thereby not possessing the ability to learn the concept of collision avoidance. Only if the discriminator learns the collision avoidance property, can we expect it to teach the generator to output collision-free trajectories. The global discriminator of S-BiGAT [Kosaraju2019SocialBiGATMT] performs spatial interaction modelling only once, at the end of prediction. Thus, the global discriminator is able to reason about interactions spatially but cannot model the temporal evolution of the same. SGANv2 equipped with spatio-temporal interaction modelling results in near-zero prediction collision. It is apparent that spatio-temporal interaction modelling within the discriminator plays a significant role in teaching the generator the concept of collision avoidance.
We now justify the design choices of sequence modelling within the discriminator using the Dist2Goal metric. We compare an additional design of our proposed SGANv2 architecture: SGANv2-L, an SGANv2 with an LSTM discriminator. SGANv2-L trained using LSTM discriminator shows stopping behavior, indicated by the high Dist2Goal value in the test set. In other words, SGANv2-L outputs collision-free trajectories but the predictions fail to move towards the goal of the primary agent. On the other hand, SGANv2 is able to output collision-free trajectories with a lower Dist2Goal (almost matching the ground-truth Dist2Goal value of ). In conclusion, SGANv2 is able to output socially acceptable trajectories when compared to other GAN-based designs.
Iv-I Computational Time.
Speed is crucial for a method to be used in a real world setting like autonomous vehicles where you need accurate predictions about pedestrian behavior. We provide the computational time at inference for our method against baseline unimodal LSTMs with and without interaction modelling. All the run times have been benchmarked on a single NVIDIA 2080 Ti GPU. We provide the run time per scene (averaged over all the scenes in the TrajNet++ real world benchmark).
|LSTM||D-LSTM||SGANv2 w/o CS||SGANv2|
The runtimes of D-LSTM and SGANv2 without collaborative sampling are similar as the multiple future predictions in the latter case can be generated in parallel, albeit at the cost of additional memory complexity. The relatively higher computational time of collaborative sampling corresponds to the sample refinement process based on the gradients from the discriminator. Nevertheless, the absolute computational time of collaborative sampling (77ms per scene) is suitable for real-time applications like autonomous systems.
Iv-J Implementation details
The generator and the discriminator have their own spatial interaction embedding modules (SIM). Each pedestrian has his/her encoder and decoder.
The velocity of each pedestrian is embedded into a 16-dimensional vector. The hidden-state dimension of the encoder LSTM and decoder LSTM of the generator is 64. The dimension of the interaction vector of both the generator and discriminator is fixed to 64. We utilize Directional-Grid [Kothari2020HumanTF] interaction module with a grid of size and a resolution of meters. For the LSTM discriminator, the hidden-state dimension is set to 64. For the transformer-based discriminator, we stack N=4 encoder layers together. The dimension of query vector, key vector and value vector is fixed to 64. The dimension of the feedforward hidden layer within each encoder layer is set to 64. We train using ADAM optimizer [Kingma2015AdamAM]
with a learning rate of 0.0003 for the generator and 0.001 for the discriminator for 50 epochs. The ratio of generator steps to discriminator steps for LSTM discriminator and transformer-based discriminator is 2:1. For synthetic data experiment, we have access to the goals of each pedestrian. The direction to the goal is embedded into a 16-dimensional vector. The batch size is fixed to 32.
The velocity of each pedestrian is embedded into a 32-dimensional vector. The hidden-state dimension of the encoder LSTM and decoder LSTM of the generator is 128. The dimension of the interaction vector of both the generator and discriminator is fixed to 256. We utilize Directional-Grid [Kothari2020HumanTF] interaction module with a grid of size and a resolution of meters. For the LSTM discriminator, the hidden-state dimension is set to 128. The ratio of generator steps to discriminator steps is 2:1. For the transformer-based discriminator, we stack N=2 encoder layers together (see Fig. 2 of main text). The dimension of query vector, key vector and value vector is fixed to 128. The dimension of the feedforward hidden layer within each encoder layer is set to 1024. We train using ADAM optimizer [Kingma2015AdamAM] with a learning rate of 0.001 for both the generator and the discriminator for 25 epochs with a learning rate scheduler of step-size 10. The batch size is fixed to 32. The weight of variety loss is set to 0.2.
The velocity of each pedestrian is embedded into a 16-dimensional vector. The hidden-state dimension of the encoder LSTM and decoder LSTM of the generator is 32. We train using ADAM optimizer [Kingma2015AdamAM] with a learning rate of 0.0003 for the generator and 0.001 for the discriminator.
We presented SGANv2, an improved SGAN architecture equipped with two crucial architectural changes in order to output safety-compliant trajectories. First, SGANv2 incorporates spatio-temporal interaction modelling that can help to understand the subtle nuances of human interactions. Secondly, the transformer-based discriminator better guides the generator learning process. Furthermore, the collaborative sampling strategy helps leverage the trained discriminator during test-time to identify and refine the socially-unacceptable trajectories output by the generator. We empirically demonstrated the strength of SGANv2 to reduce the model collisions without comprising the distance-based metrics. We additionally highlighted the potential of collaborative sampling to overcome mode collapse in a challenging multimodal scenario.
Our work aims at expanding the current horizon of trajectory forecasting models for real-world applications where humans’ lives are at risk, such as social robots or autonomous vehicles. Accuracy, safety, and robustness are all mandatory keywords. Over the past years, researchers have focused their evaluation on distance-based metrics. Yet, if we compare the methods on the safety-critical “collision" metric, we observe a difference in performance above 50%. Hence, we believe that one should focus more on this metric and develop methods that aim for zero collisions.