I Introduction
Forecasting the motion of pedestrians in crowds is essential for autonomous systems like selfdriving cars and social robots that will potentially coexist with humans. To successfully predict how humans navigate in crowds, a forecasting model needs to tackle three crucial challenges:
(1) Modelling social interactions: the model should learn how the trajectory of one person affects another person;
(2) Physically acceptable outputs: the model predictions should be physically acceptable, i.e., not undergo collisions;
(3) Multimodality: given the history, the model needs to be able to output all futures without missing any mode.
The objective of multimodal trajectory forecasting is to learn a generative model over future trajectories. Generative adversarial networks (GANs) [Goodfellow2014GenerativeAN] are a popular choice of generative models for trajectory forecasting as they can effectively capture all possible future modes by mapping samples from a given noise distribution to samples in real data distribution. Gupta et al. [Gupta2018SocialGS] proposed Social GAN (SGAN), GANs with social mechanisms, to learn human interactions and output multimodal trajectories. Following the success of SGAN, recent works [Kosaraju2019SocialBiGATMT, Sadeghian2018SoPhieAA, Zhao2019MultiAgentTF, Amirian2019SocialWL] have proposed improved GAN architectures to better model human interactions in crowds. Indeed, these designs have been successful in reducing the distancebased metrics on realworld datasets [Kosaraju2019SocialBiGATMT]. However, we discover that they fail to model social interactions i.e., the models output colliding trajectories.
Method  Generative Model 






SLSTM [Alahi2016SocialLH]  –  ✓  ✗  –  –  ✗  
DESIRE [Lee2017DESIREDF]  VAE  ✓  ✓  –  –  ✓  
Trajectron [Ivanovic2018TheTP]  VAE  ✓  ✓  –  –  ✗  
SGAN [Gupta2018SocialGS]  GAN  ✗  ✓  ✗  RNN  ✗  
SBiGAT [Kosaraju2019SocialBiGATMT]  GAN  ✓  ✓  ✗  RNN  ✗  
SGANv2 [Ours]  GAN  ✓  ✓  ✓  Transformer  ✓ 
The failure to output collisionfree trajectories can be attributed to the fact that the current discriminator designs do not fully model humanhuman interactions; hence they are incapable of differentiating real trajectory data from fake data. Only when the discriminator is capable of differentiating real data from fake data, can the supervised signal from it be meaningful to teach the generator. To tackle this issue, we propose two architectural changes to the SGAN design: (1) Spatiotemporal interaction modelling to better discriminate between real and generated trajectories. (2) A transformerbased discriminator design to strengthen the sequence modelling capability and better guide the generator training. Equipped with these structural changes, our proposed architecture SGANv2, learns to better model the underlying etiquette of human motion as evidenced by reduced collisions.
To further reduce the prediction collisions, SGANv2 leverages the trained discriminator even at test time. In particular, we perform collaborative sampling [Liu2019CollaborativeGS] between the generator and discriminator at testtime to guide the unsafe trajectories sampled from the generator. Additionally, we empirically demonstrate that collaborative sampling not only helps to refine trajectories but also has the potential to prevent mode collapse, a phenomenon where the generator fails to capture all modes in the output distribution.
We empirically validate the efficacy of SGANv2 in outputting socially compliant predictions on both synthetic and realworld trajectory datasets. First, we shed light on the shortcomings of the current metric commonly used to measure the multimodal performance, namely Top20 ADE/FDE [Gupta2018SocialGS]. Specifically, we demonstrate that a simple predictor that outputs uniformly spaced predictions performs at par with the stateoftheart methods when evaluated using only Top20 ADE/FDE. To counter this limitation, we propose an alternate evaluation scheme to better measure the sociallycompliant multimodal performance of a model. We demonstrate that SGANv2 outperforms competitive baselines on both synthetic and realworld trajectory datasets under the new evaluation scheme. Finally, we demonstrate the ability of collaborative sampling to prevent mode collapse on the recently released Forking Paths [liang2020garden] dataset. Our main contributions are:

[topsep=0pt]

We propose SGANv2, an improved SGAN architecture that incorporates spatiotemporal interaction modelling in both the generator and the discriminator. Moreover, our transformerbased discriminator better guides the learning process of the generator.

We demonstrate the efficacy of collaborative sampling between the generator and discriminator at testtime to reduce prediction collisions and prevent mode collapse in trajectory forecasting.
Ii Related Work
Human trajectory forecasting in crowds has been an active area of research [SocialForce, Alahi2016SocialLH, Li2020SocialWaGDATIT, Huang2019STGATMS, Mohamed2020SocialSTGCNNAS, Zhu2019StarNetPT, Giuliari2020TransformerNF, Yu2020SpatioTemporalGT, Su2022TrajectoryFB, Zhang2019SRLSTMSR, Kothari2021InterpretableSA, KothariAdversarialLF, Liu2021SocialNC, Daniel2021PECNetAD, saadatnejad_sattack, Liu2022CausalMotionRepresentations] for various applications like autonomous systems [WaymoSafety, UberSafety, Chen2019CrowdRobotIC, Rasouli2020AutonomousVT] and advanced surveillance [Mehran2009AbnormalCB]. In this section, we review model designs that learn social interactions and output socially compliant multimodal outputs. Table I provides a highlevel overview of how SGANv2 architecture differs from selected generative modelbased designs.
Spatiotemporal interaction modelling. The seminal work of Social LSTM [Alahi2016SocialLH] proposed to learn spatial interactions in a datadriven manner with a novel social pooling layer. Following the success of Social LSTM, various designs of datadriven interaction modules have been proposed [Pfeiffer2017ADM, Shi2019PedestrianTP, Bisagno2018GroupLG, Gupta2018SocialGS, Zhang2019SRLSTMSR, Zhu2019StarNetPT, Ivanovic2018TheTP, Liang2019PeekingIT, Tordeux2019PredictionOP, Ma2016AnAI, Hasan2018MXLSTMMT, Mohamed2020SocialSTGCNNAS, Li2020EvolveGraphHM, Yu2020SpatioTemporalGT] to effectively model interactions in crowds. For a detailed taxonomy on the designs of interaction modules, one can refer to Kothari et al. [Kothari2020HumanTF]. In this work, we highlight the importance of modelling both the spatial and temporal nature of social interactions.
Architectures that model dynamics of entities in spatiotemporal tasks have been wellstudied. StructuralRNN [Jain2016StructuralRNNDL], a specialized RNN design, proposed to model dynamics in spatiotemporal tasks like humanobject interaction and driver maneuver anticipation. Specific to motion forecasting, several works consider the temporal evolution of spatial human interactions using recurrent mechanisms [Vemula2017SocialAM, Huang2019STGATMS, Li2020EvolveGraphHM], graph convolutional networks [Mohamed2020SocialSTGCNNAS, Sun2020RecursiveSB] as well as transformers [Yu2020SpatioTemporalGT]. However, many recent works advocated performing spatial interaction modelling only at the end of observation [Gupta2018SocialGS, Kosaraju2019SocialBiGATMT], as this strategy did not impact the distancebased metrics and saved computational time. In this work, we study the importance of spatiotemporal interaction modelling from the perspective of reducing the collisions in model outputs.
Multimodal forecasting.Neural networks trained using loss are condemned to output the average of all possible outcomes. To tackle this, one line of work proposes loss variants [GuzmnRivera2012MultipleCL, Rupprecht2016LearningIA, Makansi2019OvercomingLO, Huang2019STGATMS] capable of handling multiple hypotheses. However, these variants fail to penalize low quality predictions, for e.g. samples that are far away from the ground truth and undergo collisions. Thus, training using these variants can result in high diversity but low quality predictions.
Another line of work utilizes generative models [Lee2017DESIREDF, Ivanovic2018TheTP, Gupta2018SocialGS, Amirian2019SocialWL, Kosaraju2019SocialBiGATMT, Huang2021STIGANMP]
, with Variational Autoencoders (VAEs) and Generative Adversarial networks (GANs) being the most popular, to model future trajectory distribution. VAE models in trajectory forecasting
[Lee2017DESIREDF, Ivanovic2018TheTP] employ a loss objective based on different variants of the euclidean distance. Such a formulation leads to low quality samples especially when the predictions are uncertain [Dosovitskiy2016GeneratingIW]. On the other hand, the discriminator of the GAN framework acts as a learned loss function that naturally penalizes the low quality samples under the adversarial training objective
i.e. penalty is incurred on the generator if a sample does not look real [Goodfellow2014GenerativeAN]. Thus, we choose GANs as our generative model as they can effectively produce diverse and highquality modes by transforming samples from a noise distribution to samples in the real data.GANs in trajectory forecasting. SGAN [Gupta2018SocialGS] used an LSTM encoderdecoder with social mechanisms within the GAN framework [goodfellow_generative_2014] to perform multimodal forecasting. Following the success of SGAN, various GANbased architectures have been proposed to better model multimodality in crowds [Li2019WhichWA, Kosaraju2019SocialBiGATMT, Amirian2019SocialWL] as well as on roads [Roy2019VehicleTP, Jin2022AGS]. Yuke Li [Li2019WhichWA] proposed to infer the latent decisions of the agents to model multimodality. Kosaraju et. al. [Kosaraju2019SocialBiGATMT] proposed to introduce two discriminators: a local discriminator for the local pedestrian trajectories, similar to [Amirian2019SocialWL, Gupta2018SocialGS], and a global discriminator that accounted for the spatial interactions. All these works exhibit two common design choices: (1) they do not perform spatiotemporal interaction modelling within the discriminator, (2) they utilize a recurrent LSTMbased discriminator.
It is crucial to equip the discriminator with the ability to model spatiotemporal interactions. Therefore, SGANv2 performs spatiotemporal interaction modelling within the discriminator, along with the generator. Transformers [Vaswani2017AttentionIA] have been shown to outperform RNNs in almost all sequence modelling tasks, including trajectory forecasting [Giuliari2020TransformerNF, Yu2020ImprovedOI]. Therefore, we design our discriminator using the transformer and demonstrate that it better guides the generator training. Giuliari et al. [Giuliari2020TransformerNF] do not take into account social interactions leading to high collisions in the outputs. The spatiotemporal transformer design of STAR [Yu2020SpatioTemporalGT] is most closely related to the design of our discriminator. However, as discussed above, their loss training objective can fail to effectively model multimodality. Further, in contrast to previous transformer and GANbased works, SGANv2 performs testtime refinement that leads to further collision reduction, discussed next.
Testtime Refinement. refers to the task of refining model predictions at testtime. Lee et al. [Lee2017DESIREDF] propose an inverse optimal control based module to refine the predicted trajectories. Sun et al. [Sun2020ReciprocalLN] refine trajectories using a reciprocal network that reconstructs input trajectories given the predictions. However, they rely on the strong assumption that both forward and backward trajectories follow identical rules of human motion. We propose to refine trajectories by performing collaborative sampling between the trained generator and discriminator [Liu2019CollaborativeGS]. This technique provides theoretical guarantees with respect to moving the generator distribution closer to real distribution.
Mode Collapse. is the phenomenon where the generator distribution fails to capture all modes of target distribution. SGAN collapses to a single mode of behavior. Social Ways [Amirian2019DataDrivenCS] utilizes InfoGAN that overcomes this issue albeit on a toy dataset. We empirically show that the collaborative sampling technique in SGANv2 overcomes mode collapse on the morediverse Forking Path dataset [liang2020garden].
Iii Method
Modelling human trajectories using generative adversarial networks (GANs) has the potential to learn the underlying etiquette of human motion and output realistic multimodal predictions. Indeed, recent GANbased trajectory forecasting models have been successful in reducing distancebased metrics, however they suffer from high prediction collisions. In this section, we present SGANv2, an improvement over the SGAN architecture to output safetycompliant predictions. On a high level, we propose three structural changes: (1) Spatiotemporal interaction modelling within the discriminator and generator to better understand social interactions, (2) Transformerbased discriminator to better guide the generator, (3) Collaborative sampling mechanism between the generator and discriminator to refine the colliding trajectories at testtime. Our proposed changes are generic and can be employed on top of any existing GANbased architecture.
Iiia Problem Definition
Given a scene, we receive as input the trajectories of all people within the scene denoted by , where is the number of people in the scene. The trajectory of a person , is defined as , for time and the future groundtruth trajectory is defined as for time . The objective is to accurately and simultaneously forecast the future trajectories of all people , where is used to denote the predicted trajectory of person . The velocity of a pedestrian at timestep is denoted by .
IiiB Generative Adversarial Networks
GANs consist of two neural networks, namely the generator and the discriminator , which are trained together in tandem. The objective of is to correctly identify whether a sample belongs to the real data distribution or is generated by the generator. The objective of is to produce realistic samples which can fool the discriminator. takes as input a noise vector sampled from a given noise distribution and transforms it into a real looking sample
outputs a probability score indicating whether a sample comes from the generator distribution
or the real data distribution . Training GANs is essentially a minimax game between the generator and the discriminator:(1) 
IiiC Interaction Modelling Designs
Modelling social interactions is the key to outputting safe and accurate future trajectories. In this work, we argue that current works do not model interactions between agents sufficiently within both the generator and discriminator leading to large number of prediction collisions. Here, we differentiate between the notion of performing spatial interaction modelling and performing spatiotemporal interactions modelling. An architectural design is said to perform spatial interaction modelling if it models the interaction between pedestrians at a single timestep only. For instance, SGAN performs spatial interaction modelling within the generator as it encodes the neighbourhood information only once, at the end of the observation. On the other hand, an architectural design is said to perform spatiotemporal interaction modelling if it performs the spatial interaction modelling at every timestep (from to ) and the temporal evolution of the interactions are captured using any sequence encoding mechanism, e.g. an LSTM or a Transformer. We empirically demonstrate that spatiotemporal interactions modelling within both the generator and the discriminator are essential to output safer trajectories.
IiiD SGANv2
We now describe our proposed model design in detail (see Fig. 2). Our architecture consists of three key components: the Spatial Interaction embedding Module (SIM), the Generator (G), and the Discriminator (D). SIM is responsible for spatial interaction modelling while the G and D perform temporal modelling. Thus, G and D in congregation with SIM perform spatiotemporal interaction modelling (STIM). In particular, SIM performs motion embedding and spatial interaction embedding for each pedestrian at each timestep. G encodes the embedded sequence through time and outputs multimodal predictions using an LSTM encoderdecoder framework. D, modelled using a transformer [Vaswani2017AttentionIA], inputs the entire sequence comprising the observed trajectory and the future prediction (or groundtruth
), and classifies it as real/fake.
Spatial Interaction Embedding Module. One important characteristic that differentiates human motion forecasting from other sequence prediction tasks is the presence of social interactions: the trajectory of a person is affected by other people in their vicinity. SIM performs the task of encoding human motion and humanhuman interactions in the spatial domain at a particular timestep. We embed the velocity of pedestrian at time using a single layer MLP to get the motion embedding vector given as:
(2) 
where is the embedding function with weights .
The design of SIM is flexible and it can utilize any spatial interaction module proposed in literature (e.g. [Kothari2020HumanTF, Kosaraju2019SocialBiGATMT]). It embeds the spatial configuration of the scene and outputs the interaction embedding for pedestrian at timestep . We then concatenate the motion embedding with the spatial interaction embedding, i.e. , and provide the concatenated embedding to the G (or the D). The input embedding is constructed using the groundtruth observations from , and generator predictions from .
Generator. Within the generator, the encoder LSTM encodes the input embedding sequence provided by the SIM. The encoder LSTM helps to model the temporal evolution of spatial interactions in the form of the following recurrence:
(3) 
where denotes the hidden state of pedestrian at time , are the weights of encoder LSTM that are learned.
The output of the LSTM encoder for each pedestrian at the end of the observation period represents his/her observed scene representation. Similar to SGAN, we utilize this representation to condition our GAN for prediction. In other words, SGANv2 take as input noise
and the observed scene representation to produce future trajectories that are conditioned on the past observations. The decoder hiddenstate of each pedestrian is initialized with the final hiddenstate of the encoder LSTM. The input noise
is concatenated with the inputs of the decoder LSTM, resulting in the following recurrence for the decoder LSTM:(4) 
where are the weights of decoder LSTM.
The decoder hiddenstate at timestep of pedestrian is then used to predict the next velocity at timestep . Similar to Alahi et al.[Alahi2016SocialLH]
, we model the next velocity as a bivariate Gaussian distribution parametrized by the mean
and correlation coefficient :(5) 
where is an MLP and is learned.
Discriminator. The social interactions between humans evolve with time. Therefore, we design our discriminator to perform spatiotemporal interaction modelling. Also, in recent times, transformers [Vaswani2017AttentionIA] have become the defacto model for modelling temporal sequences, replacing recurrent architectures [Giuliari2020TransformerNF, Yu2020SpatioTemporalGT]. Therefore, we design the discriminator as a transformer to perform the temporal sequence modelling of the output provided by SIM.
The discriminator takes as input or and classifies them as real/fake. The discriminator has its own SIM, which provides the spatial interaction embedding for each pedestrian at each timestep in the input sequence. Instead of passing through an LSTM (similar to the generator), we stack these embedded vectors together to form an embedded sequence for each pedestrian (similar to an embedded sequence obtained after embedding word tokens in the field of natural language [Vaswani2017AttentionIA]):
(6) 
This sequence is given as input to the encoder of the transformer proposed in [Vaswani2017AttentionIA]. The ability of transformers to capture the temporal correlations within the spatial interaction embedding lies mainly in its selfattention module. Within the attention module, each element of the sequence is decomposed into query (Q), key (K) and value (V). The matrix of outputs is computed using the following equation [Vaswani2017AttentionIA]:
(7) 
where is the dimension of the SIM embedding . The output of the attention layer is normalized and passed through a feedforward layer to obtain the latent representation of the input sequence, denoted by :
(8) 
where the weights are learned, represents matrix multiplication and denotes the normalized representation of the output of the attention module. We utilize the last element of , as the representation of the input sequence. This embedding gets scored using an MLP to determine if the sequence is real or fake.
IiiE Training
As mentioned earlier, SGANv2 is a conditional GAN model. It takes as input noise vector , sampled from , and outputs future trajectories conditioned on the past observations . We found the leastsquare training objective [Mao2017LeastSG] to be effective in training SGANv2:
(9)  
(10) 
Additionally, we utilize the variety loss [Gupta2018SocialGS] to further encourage the network to produce diverse samples. For each scene, we generate output predictions by randomly sampling and penalize the prediction closest to the groundtruth based on L2 distance.
(11) 
Following the strategy in [Kothari2020HumanTF], the generator predicts only the trajectory of the pedestrian of interest in each scene and uses the groundtruth future of neighbours during training. During test time, we predict the trajectories of all the pedestrians simultaneously in the scene. All the learnable weights are shared between all pedestrians in the scene.
IiiF Collaborative Sampling in GANs
The common practice in GANs is to sample from the generator and discard the discriminator during test time. However, our trained discriminator has knowledge regarding the social etiquette of human motion. We can utilize this knowledge to refine the bad predictions proposed by the generator. We define a prediction as bad if the pedestrian of interest undergoes collision in the model prediction. We propose to refine such trajectories by performing collaborative sampling [Liu2019CollaborativeGS] between the generator and discriminator, as demonstrated in Fig. 3.
Model  ETH  HOTEL  UNIV  ZARA1  ZARA2  
Top 3  Top 20  Col  Top 3  Top 20  Col  Top 3  Top 20  Col  Top 3  Top 20  Col  Top 3  Top 20  Col  
Transformer [Giuliari2020TransformerNF]  1.0/1.9  0.6/0.9  5.8  0.5/0.9  0.3/0.5  8.2  2.3/4.2  0.8/1.3  10.9  0.5/1.0  0.3/0.4  7.1  0.4/0.8  0.2/0.3  11.3 
STGAT [Huang2019STGATMS]  0.9/1.8  0.7/1.2  1.7  0.7/1.4  0.5/1.0  4.2  0.6/1.2  0.3/0.7  13.9  0.4/0.9  0.2/0.4  3.9  0.4/0.7  0.2/0.4  6.9 
SocialSTGCNN [Mohamed2020SocialSTGCNNAS]  1.0/1.8  0.7/1.2  6.7  0.4/0.8  0.3/0.6  10.4  0.7/1.3  0.5/0.8  25.0  0.5/0.9  0.3/0.5  12.1  0.4/0.8  0.3/0.5  19.4 
Uniform Predictor (UP)  1.1/2.2  0.6/0.9  3.3  0.5/0.9  0.2/0.4  5.1  0.6/1.3  0.3/0.6  15.7  0.5/1.0  0.3/0.6  4.7  0.4/0.8  0.2/0.4  7.5 
SGANv2 [Ours]  1.0/1.9  0.7/1.2  1.0  0.4/0.7  0.3/0.5  1.2  0.6/1.3  0.5/0.8  8.3  0.4/0.8  0.3/0.6  1.3  0.3/0.7  0.3/0.5  2.2 
To summarize collaborative sampling for the case of trajectory forecasting, our goal is to refine the generator prediction using gradients from the discriminator without updating the parameters of the generator. We leverage the gradient information provided by the discriminator to continuously refine the generator predictions of the pedestrian of interest through the following iterative update:
(12) 
where is the iteration number, is the stepsize, is the loss of the generator in Eq. 9. The authors demonstrate that the above iteration process theoretically, under mild assumptions, shifts the learned generator distribution towards the real distribution [Liu2019CollaborativeGS]. The trajectories are updated till either the discriminator score goes above a defined threshold or the maximum number of iterations is reached.
Iv Experiments
In this section, we highlight the ability of SGANv2 to output sociallycompliant multimodal futures. We evaluate the performance of our architecture against several stateoftheart methods on the ETH/UCY datasets [Pellegrini2009YoullNW, Lerner2007CrowdsBE] and on the interactioncentric TrajNet++ benchmark [Kothari2020HumanTF]. Additionally, we highlight the potential of collaborative sampling to prevent mode collapse on the Forking Paths [liang2020garden] dataset. We evaluate two variants of our model against various baselines:

SGANv2 w/o CS: Our GAN architecture comprising of a transformerbased discriminator that performs spatiotemporal interaction modelling.

SGANv2: Our complete GAN architecture in combination with collaborative sampling at testtime.
Iva Evaluation Metrics

[itemsep=0.25cm]

Topk Average Displacement Error (ADE): Average distance between ground truth and closest prediction (out of k samples) over all predicted time steps.

Topk Final Displacement Error (FDE): The distance between the final destination of closest prediction (out of k samples) and the ground truth final destination at the end of the prediction period .

Prediction collision (Col) [Kothari2020HumanTF]: The percentage of collision between the primary pedestrian and the neighbors in the forecasted future scene.
IvB Limitations of current multimodal evaluation scheme
Current multimodal forecasting works utilize metrics that measure model performance at the individual level such as the top ADE/FDE [Gupta2018SocialGS, Huang2019STGATMS]. This metric evaluates the quality of the predicted distribution per pedestrian; and does not measure the interaction between different pedestrians. Further, the value of is very high (k=20 being most common). Almost all the recent works [Gupta2018SocialGS, Kosaraju2019SocialBiGATMT, Daniel2021PECNetAD, Huang2019STGATMS, Giuliari2020TransformerNF] in human trajectory forecasting utilize the Top20 ADE/FDE metric [Gupta2018SocialGS] to quantify multimodal performance. We argue that measuring multimodal performance based solely on this metric can be misleading.
The Top20 ADE/FDE metric can be easily cheated by predicting a high entropy distribution that covers all the space but is not precise [Eghbalzadeh2017LikelihoodEF]. We empirically validate this claim by comparing stateoftheart baselines against a simple handcrafted uniform predictor (UP). UP takes as input the last observed velocity of each pedestrian and outputs 20 uniformly spread trajectories (see Fig 4). UP outputs 20 predictions using the combination of 5 different relative direction profiles (in degrees relative to current direction of motion) and 4 different relative speed profiles (factors of the current speed).
Table II compares the performance of recent stateoftheart methods [Giuliari2020TransformerNF, Huang2019STGATMS, Mohamed2020SocialSTGCNNAS] and UP on ETHUCY datasets. It is apparent that by observing the Top20 metric only, UP seems to perform better (or at par) against the stateoftheart baselines. If we note the prediction collisions, it is apparent that UP is not a good multimodal predictor. This corroborates our conjecture that a high entropy distribution can easily cheat the Top20 metric leading to incorrect conclusions.
IvC Multimodal Evaluation Scheme
To counter the above issues with current multimodal evaluation strategy, we propose to set to a lower value in our experiments; as a lower
is a better proxy for likelihood estimation for implicit generative models
[Eghbalzadeh2017LikelihoodEF]. Specific to our problem, we will demonstrate that when is low (), the uniform predictor due to a lack of modeling social interactions performs poorly compared to interactionbased baselines [Huang2019STGATMS, Mohamed2020SocialSTGCNNAS]. Further, to measure the interactionmodelling capability, we focus on the percentage of collisions between the primary pedestrian and the neighbors in the forecasted future scene.IvD Synthetic Experiments
We first demonstrate the efficacy of our proposed architectural changes in SGANv2 compared to other generative model designs in the TrajNet++ synthetic setup. We observe that SGANv2 greatly improves upon the Top3 ADE/FDE metric with a lower collision metric compared to training a model using only variety loss (see Table III).
Method  Top 3  Col 
CV* [Schller2020WhatTC]  0.4/1.0  21.1 
LSTM* [Hochreiter1997LongSM]  0.3/0.6  19.0 
SLSTM* [Kothari2020HumanTF]  0.2/0.5  2.2 
DLSTM* [Kothari2020HumanTF]  0.2/0.5  2.2 
CVAE [Lee2017DESIREDF]  0.2/0.5  4.6 
WTA [Rupprecht2016LearningIA]  0.2/0.4  2.4 
SGAN [Gupta2018SocialGS]  0.2/0.4  2.8 
SGANv2 w/o CS [Ours]  0.2/0.4  1.9 
SGANv2 [Ours]  0.2/0.4  0.6 
Model  ETH  HOTEL  UNIV  ZARA1  ZARA2  
Top 3  Col  Top 3  Col  Top 3  Col  Top 3  Col  Top 3  Col  
CV* [Schller2020WhatTC]  1.1/2.3  5.3  0.4/0.8  7.2  0.6/1.4  20.3  0.4/1.0  6.0  0.3/0.7  9.6 
LSTM* [Hochreiter1997LongSM]  1.0/2.1  5.8  0.5/0.9  6.7  0.6/1.3  20.2  0.5/1.0  5.2  0.4/0.8  9.5 
Uniform Predictor  1.1/2.2  3.3  0.5/0.9  5.1  0.6/1.3  15.7  0.5/1.0  4.7  0.4/0.8  7.5 
Transformer [Giuliari2020TransformerNF]  1.0/1.9  5.8  0.5/0.9  8.2  2.3/4.2  10.9  0.5/1.0  7.1  0.4/0.8  11.3 
SLSTM* [Alahi2016SocialLH]  1.1/2.1  2.2  0.5/0.9  2.5  0.7/1.5  11.8  0.4/0.9  2.7  0.4/0.8  3.7 
CVAE [Lee2017DESIREDF]  1.1/2.2  2.8  0.4/0.8  1.5  0.7/1.5  12.6  0.4/0.9  2.6  0.4/0.8  3.5 
WTA [Rupprecht2016LearningIA]  1.0/1.9  2.5  0.4/0.7  2.3  0.6/1.3  12.7  0.4/0.8  2.2  0.3/0.7  4.1 
SGAN [Gupta2018SocialGS]  1.0/2.0  2.2  0.4/0.7  1.7  0.6/1.3  11.8  0.4/0.8  2.3  0.3/0.7  3.2 
STGAT [Huang2019STGATMS]  0.9/1.8  1.7  0.7/1.4  4.2  0.6/1.2  13.9  0.4/0.9  3.9  0.4/0.7  6.9 
SocialSTGCNN [Mohamed2020SocialSTGCNNAS]  1.0/1.8  6.7  0.4/0.8  10.4  0.7/1.3  25.0  0.5/0.9  12.1  0.4/0.8  19.4 
SBiGAT [Kosaraju2019SocialBiGATMT]  1.0/1.9  3.3  0.4/0.7  1.7  0.6/1.3  11.5  0.4/0.8  2.2  0.3/0.7  3.3 
SGANv2 w/o CS [Ours]  1.0/1.9  1.7  0.4/0.7  1.4  0.6/1.3  11.5  0.4/0.8  2.1  0.3/0.7  3.6 
SGANv2 [Ours]  1.0/1.9  1.0  0.4/0.7  1.2  0.6/1.3  8.3  0.4/0.8  1.3  0.3/0.7  2.2 
Next, we utilize collaborative sampling technique to refine trajectories that undergo collision at testtime. The trained discriminator provides feedback to the colliding samples which helps to reduce the collisions. For each colliding prediction, we perform 5 refinement iterations with stepsize 0.01. We observe that this scheme greatly reduces the collision rate by 70%. The first row of Fig 5 illustrates the ability of collaborative sampling to refine predictions in the synthetic scenario.
Method  Top 3  Col 
CV* [Schller2020WhatTC]  0.6/1.3  10.9 
LSTM* [Hochreiter1997LongSM]  0.5/1.2  9.3 
SLSTM* [Alahi2016SocialLH]  0.5/1.0  4.9 
DLSTM* [Kothari2020HumanTF]  0.5/1.1  3.9 
CVAE [Lee2017DESIREDF]  0.5/1.1  3.9 
WTA [Rupprecht2016LearningIA]  0.5/1.0  3.5 
SGAN [Gupta2018SocialGS]  0.5/1.0  3.5 
SNCE [Liu2021SocialNC]  0.5/1.1  4.0 
PECNet [Daniel2021PECNetAD]  0.4/0.9  10.7 
Uniform Predictor  0.6/1.2  8.4 
Transformer [Giuliari2020TransformerNF].  0.7/1.3  9.4 
STGCNN [Mohamed2020SocialSTGCNNAS]  0.6/1.1  12.6 
STGAT [Huang2019STGATMS]  0.5/1.1  5.6 
SBiGAT [Kosaraju2019SocialBiGATMT]  0.5/1.0  3.3 
SGANv2 w/o CS [Ours]  0.5/1.0  3.1 
SGANv2 [Ours]  0.5/1.0  2.3 
IvE RealWorld Experiments
Next, we evaluate the performance of our SGANv2 architecture in realworld datasets of ETH/UCY and the TrajNet++ benchmark. For ETH/UCY, we observe the trajectories for 8 times steps (3.2 seconds) and show prediction results for 12 (4.8 seconds) time steps. For TrajNet++, we observe the trajectories for 9 times steps (3.6 seconds) and show prediction results for 12 (4.8 seconds) time steps.
Table IV provides the quantitative evaluation of various baselines and stateoftheart forecasting methods on the ETH/UCY dataset. We observe that SGANv2 outputs safer predictions in comparison to competitive baselines without compromising on the prediction accuracy. Our Top3 ADE/FDE are on par with (if not better than) stateoftheart methods while our collision rate is significantly reduced thanks to spatiotemporal interaction modelling. It is further interesting to note that Trajectory Transformer [Giuliari2020TransformerNF] and the simple uniform predictor (UP) that performed the best on Top20 ADE/FDE in Table II are not among the top performing methods when evaluated on the morestrict Top3 ADE/FDE. Next, we benchmark on the TrajNet++ with interactioncentric scenes with a standardized evaluator that provides a more objective comparison [Kothari2020HumanTF].
Table V compares SGANv2 against other competitive baselines on TrajNet++ realworld benchmark. The first part of Table V reports simple baselines and the top3 official submissions on AICrowd made by different works literature [Liu2021SocialNC, Daniel2021PECNetAD, Kothari2020HumanTF]. SGANv2 performs at par with the topranked PECNet [Daniel2021PECNetAD] on the Top3 evaluation while having 3x lower collisions demonstrating that spatiotemporal interaction modelling is key to outputting safer trajectories ^{1}^{1}1PECNet performs spatial interaction modelling once at end of observation
. Additionally, we utilize the opensource implementation of three additional stateoftheart methods (denoted by
) and evaluate them on the TrajNet++ benchmark. Compared to these competing baselines, SGANv2 improves upon the Top3 ADE/FDE metric by 10% and the collision metric by 40%.We perform collaborative sampling to refine trajectories that undergo collision in real world datasets. For each colliding prediction, we perform 5 refinement iterations with stepsize 0.01. We observe that this procedure reduces the collision rate by 30% on both ETH/UCY and TrajNet++. The trained discriminator understands human social interactions, and provides feedback to the bad samples, and consequently helps to reduce collisions. The second row of Fig 5 illustrates a few realworld scenarios where collaborative sampling demonstrates the ability to refine generator predictions that undergo collisions. In conclusion, we observe that SGANv2 beats competitive baselines in generating sociallycompliant trajectories without compromising on the distancebased metrics.
IvF Ablation: Interaction Modelling
In Table VI, we empirically demonstrate that modelling interactions is the key to reducing prediction collisions. We consider the performance of different variants of our proposed SGANv2 architecture based on the interaction modelling schemes within the generator and discriminator. It is apparent that modelling interaction within both the generator and discriminator is necessary to output safe multimodal trajectories.
TrajNet++ Synth  TrajNet++ Real  
Top 3  Col  Top 3  Col  
✗  ✗  0.3 / 0.5  18.3  0.5 / 1.1  9.6 
✓  ✗  0.2 / 0.4  4.1  0.5 / 1.0  3.9 
✓  ✓  0.2 / 0.4  2.9  0.5 / 1.0  3.1 
IvG Multimodal Analysis
In this final experiment, we demonstrate the potential of collaborative sampling to prevent mode collapse in trajectory generation. We utilize the sample scene ‘Zara01’ from the Forking Paths dataset. We choose this scene as the multimodal futures of the ‘Zara01’ scene is only affected by social interactions, and not physical obstacles. It forms the ideal test ground to check the multimodal performance of forecasting models. In this experiment, we observe the trajectories for 8 times steps (3.2 seconds) and show prediction results for 13 (5.2 seconds) time steps.
Fig. 6 qualitatively illustrates the performance of a GAN model trained using variety loss [Rupprecht2016LearningIA, Gupta2018SocialGS] and other GAN objectives on the chosen scene. As there are 4 dominant modes in the scene, we chose for the variety loss. The model trained using variety loss (Fig. 5(b)
) ends up learning a uniform distribution,
i.e., high diversity and low quality, as there is no penalty on the bad samples during training. Variety loss only penalizes the sample closest to the groundtruth. SGAN training [Gupta2018SocialGS] (Fig. 5(c)) results in mode collapse, i.e., low diversity and high quality as standard GAN training is highly unstable. Social Ways [Amirian2019SocialWL] proposed infoGAN objective [chen_infogan:_2016] to mitigate the mode collapse issue. The InfoGAN improves upon SGAN, however, it still fails to cover all the modes (Fig. 5(d)).Empirically, we found that training SGANv2 with the gradient penalty objective (Fig. 5(e)), proposed in [arjovsky_wasserstein_2017], provides a better mode coverage compared to InfoGAN, but the resulting distribution is still not accurate. As shown in Fig. 5(f), our proposed collaborative sampling at testtime helps to improve the accuracy of the SGANv2 predictions, recovering modes with low coverage. The trained discriminator guides the generated samples to these modes. Thus, we see that collaborative sampling is not only effective in refining trajectories at test time, but also can help to prevent mode collapse.
IvH Key attributes
We now analyze the performance of the key SGANv2 design choices in the TrajNet++ synthetic setup. In the synthetic setup, we have access to the goals of each agent, allowing us to calculate DistancetoGoal (Dist2Goal) [Ma2017ForecastingID], defined as the L2 distance between the predicted final destination and the goal of the agent.
Rationale behind Distance to Goal: It is possible that the generator predicts a sociallyacceptable mode that does not correspond to the groundtruth mode (see Fig. 7). If we calculate the ADE/FDE with respect to the groundtruth for such a predicted mode (that differs from groundtruth), the numbers will be high, misleading us to incorrectly conclude that the generator did not learn the underlying task of trajectory forecasting. However, if the predicted destination is close to the goal of the agent, then one can assert that a different but socially acceptable mode has been predicted. The Col metric will help to validate that no collisions take place. Thus, Dist2Goal in combination with the Col metric helps to validate that a predicted mode is socially plausible.
Table VII quantifies the performance of various GAN architectures trained without variety loss [Rupprecht2016LearningIA]. SGAN [Gupta2018SocialGS] performs the worst on the Col metric as the discriminator does not perform any interaction modelling, thereby not possessing the ability to learn the concept of collision avoidance. Only if the discriminator learns the collision avoidance property, can we expect it to teach the generator to output collisionfree trajectories. The global discriminator of SBiGAT [Kosaraju2019SocialBiGATMT] performs spatial interaction modelling only once, at the end of prediction. Thus, the global discriminator is able to reason about interactions spatially but cannot model the temporal evolution of the same. SGANv2 equipped with spatiotemporal interaction modelling results in nearzero prediction collision. It is apparent that spatiotemporal interaction modelling within the discriminator plays a significant role in teaching the generator the concept of collision avoidance.
We now justify the design choices of sequence modelling within the discriminator using the Dist2Goal metric. We compare an additional design of our proposed SGANv2 architecture: SGANv2L, an SGANv2 with an LSTM discriminator. SGANv2L trained using LSTM discriminator shows stopping behavior, indicated by the high Dist2Goal value in the test set. In other words, SGANv2L outputs collisionfree trajectories but the predictions fail to move towards the goal of the primary agent. On the other hand, SGANv2 is able to output collisionfree trajectories with a lower Dist2Goal (almost matching the groundtruth Dist2Goal value of ). In conclusion, SGANv2 is able to output socially acceptable trajectories when compared to other GANbased designs.
Model 


Col  Dist2Goal  
Groundtruth  –  –  0.0  8.6  
SGAN [Gupta2018SocialGS]  ✗  LSTM  24.9  8.9  
SBiGAT [Kosaraju2019SocialBiGATMT]  ✗  LSTM  8.4  8.9  
SGANv2L  ✓  LSTM  0.8  8.8  
SGANv2  ✓  Transformer  0.2  8.6 
IvI Computational Time.
Speed is crucial for a method to be used in a real world setting like autonomous vehicles where you need accurate predictions about pedestrian behavior. We provide the computational time at inference for our method against baseline unimodal LSTMs with and without interaction modelling. All the run times have been benchmarked on a single NVIDIA 2080 Ti GPU. We provide the run time per scene (averaged over all the scenes in the TrajNet++ real world benchmark).
LSTM  DLSTM  SGANv2 w/o CS  SGANv2  
Time  10ms  22ms  22ms  77ms 
The runtimes of DLSTM and SGANv2 without collaborative sampling are similar as the multiple future predictions in the latter case can be generated in parallel, albeit at the cost of additional memory complexity. The relatively higher computational time of collaborative sampling corresponds to the sample refinement process based on the gradients from the discriminator. Nevertheless, the absolute computational time of collaborative sampling (77ms per scene) is suitable for realtime applications like autonomous systems.
IvJ Implementation details
The generator and the discriminator have their own spatial interaction embedding modules (SIM). Each pedestrian has his/her encoder and decoder.
Synthetic experiments.
The velocity of each pedestrian is embedded into a 16dimensional vector. The hiddenstate dimension of the encoder LSTM and decoder LSTM of the generator is 64. The dimension of the interaction vector of both the generator and discriminator is fixed to 64. We utilize DirectionalGrid [Kothari2020HumanTF] interaction module with a grid of size and a resolution of meters. For the LSTM discriminator, the hiddenstate dimension is set to 64. For the transformerbased discriminator, we stack N=4 encoder layers together. The dimension of query vector, key vector and value vector is fixed to 64. The dimension of the feedforward hidden layer within each encoder layer is set to 64. We train using ADAM optimizer [Kingma2015AdamAM]
with a learning rate of 0.0003 for the generator and 0.001 for the discriminator for 50 epochs. The ratio of generator steps to discriminator steps for LSTM discriminator and transformerbased discriminator is 2:1. For synthetic data experiment, we have access to the goals of each pedestrian. The direction to the goal is embedded into a 16dimensional vector. The batch size is fixed to 32.
Realworld experiments.
The velocity of each pedestrian is embedded into a 32dimensional vector. The hiddenstate dimension of the encoder LSTM and decoder LSTM of the generator is 128. The dimension of the interaction vector of both the generator and discriminator is fixed to 256. We utilize DirectionalGrid [Kothari2020HumanTF] interaction module with a grid of size and a resolution of meters. For the LSTM discriminator, the hiddenstate dimension is set to 128. The ratio of generator steps to discriminator steps is 2:1. For the transformerbased discriminator, we stack N=2 encoder layers together (see Fig. 2 of main text). The dimension of query vector, key vector and value vector is fixed to 128. The dimension of the feedforward hidden layer within each encoder layer is set to 1024. We train using ADAM optimizer [Kingma2015AdamAM] with a learning rate of 0.001 for both the generator and the discriminator for 25 epochs with a learning rate scheduler of stepsize 10. The batch size is fixed to 32. The weight of variety loss is set to 0.2.
Multimodal Analysis.
The velocity of each pedestrian is embedded into a 16dimensional vector. The hiddenstate dimension of the encoder LSTM and decoder LSTM of the generator is 32. We train using ADAM optimizer [Kingma2015AdamAM] with a learning rate of 0.0003 for the generator and 0.001 for the discriminator.
V Conclusion
We presented SGANv2, an improved SGAN architecture equipped with two crucial architectural changes in order to output safetycompliant trajectories. First, SGANv2 incorporates spatiotemporal interaction modelling that can help to understand the subtle nuances of human interactions. Secondly, the transformerbased discriminator better guides the generator learning process. Furthermore, the collaborative sampling strategy helps leverage the trained discriminator during testtime to identify and refine the sociallyunacceptable trajectories output by the generator. We empirically demonstrated the strength of SGANv2 to reduce the model collisions without comprising the distancebased metrics. We additionally highlighted the potential of collaborative sampling to overcome mode collapse in a challenging multimodal scenario.
Our work aims at expanding the current horizon of trajectory forecasting models for realworld applications where humans’ lives are at risk, such as social robots or autonomous vehicles. Accuracy, safety, and robustness are all mandatory keywords. Over the past years, researchers have focused their evaluation on distancebased metrics. Yet, if we compare the methods on the safetycritical “collision" metric, we observe a difference in performance above 50%. Hence, we believe that one should focus more on this metric and develop methods that aim for zero collisions.