Human trajectory prediction plays a crucial role in human-robot interaction systems such as self-driving vehicles and social robots, since human is omnipresent in their environments. Although significant progresses have been achieved over past few years [salzmann2020trajectron++, mangalam2020not, YuMa2020Spatio, dendorfer2021mg, mangalam2021goals, Pang_2021_CVPR, sun2021three, zhao2021you], predicting the future trajectories of pedestrians remains challenging due to the multi-modality of human motion.
The future trajectories of pedestrians are full of indeterminacy, because human can change future motion according to their will or adjust their movement direction based on the surroundings. Given a history of observed trajectories, there exist many plausible paths that pedestrians could move in the future. Facing this challenge, most of prior researches apply the generative model to represent multi-modality by a latent variable. For instance, some methods [gupta2018social, Fang_2020_CVPR, kosaraju2019social, Sun_2020_CVPR, sadeghian2019sophie, zhao2019multi, dendorfer2021mg]
utilize generative adversarial networks (GANs) to spread the distribution over all possible future trajectories, while other methods[salzmann2020trajectron++, ivanovic2019trajectron, lee2017desire, Chen_2021_ICCV, tang2019multiple, Liu_2021_ICCV] exploit conditional variational auto-encoder (CVAE) to encode the multi-modal distribution of future trajectories. Despite the remarkable progress, these methods still face inherent limitations, e.g., training process could be unstable for GANs due to adversarial learning, and CVAE tends to produce unnatural trajectories.
In this paper, we propose a new trajectory prediction framework, called motion indeterminacy diffusion (MID), to model the indeterminacy of human behavior. Inspired by non-equilibrium thermodynamics, we consider the future positions as particles in thermodynamics in our framework. The particles (positions) gather and deform to a clear trajectory under low indeterminacy, while stochastically spread over all walkable areas under high indeterminacy. The process of particles evolving from low indeterminacy to high indeterminacy is defined as the diffusion process. This process can be simulated by gradually adding noise to the trajectory until the path is corrupted as Gaussian noise. The goal of our MID is to reverse this diffusion process by progressively discarding indeterminacy, and converting the ambiguous prediction regions into a deterministic trajectory. We illustrate the reverse diffusion process of motion indeterminacy in Figure 1. Contrary to other stochastic prediction methods that add a noise latent variable on the trajectory feature to obtain indeterminacy, we explicitly simulate the motion indeterminacy variation process. Our MID learns a Markov chain with parameterized Gaussian transition to model this reverse diffusion process and train it using variational inference conditioned on the observed trajectories. By choosing different lengths of the chain, we can obtain the predictions with a flexible indeterminacy that is capable of adapting to dynamic environment. Moreover, our method is more efficient to train than GANs, and is capable of producing more high-quality samples than CVAEs.
To be more specific, we encode the history human trajectories and the social interactions as state embedding via a spatial-temporal graph network. Then, we exploit this state embedding as condition in the Markov chain to guide the learning of reverse diffusion process. To model the temporal dependencies in trajectories, we carefully design a Transformer-based architecture as the core network of MID framework. In the training process, we optimize the model with the variational lower bound, and during the inference, we sample the reasonable trajectories by progressive denoising from a noise distribution. Extensive experiments demonstrate that our method accurately forecasts reasonable future trajectories with multi-modality, achieving state-of-the-art results on Stanford Drone and ETH/UCY datasets. We summarize the main contributions of this paper as follows:
We present a new stochastic trajectory prediction framework with motion indeterminacy diffusion, which gradually discards the indeterminacy to obtain desired trajectory from ambiguous walkable areas.
We devise a Transformer-based architecture for the proposed framework to capture the temporal dependencies in trajectories.
The proposed method achieves state-of-the-art performance on widely used human trajectory prediction benchmarks and provides a potential direction for balancing the diversity and accuracy of predictions.
2 Related Work
Pedestrian Trajectory Prediction:
Given the observed paths, human trajectory forecasting system aims to estimate the future positions. Most existing methods formulate trajectory forecasting as a sequential prediction problem and focus on modeling the complex social interaction. For instance, Social Forces[helbing1995social]
introduces attractive and repulsive forces to model human interaction. With the success of deep learning, many methods design ingenious networks to model the social interactions. For example, Social-LSTM[alahi2016social]
devises a social pooling layer to aggregate the interaction information of neighborhoods. Some methods apply the attention models[fernando2018soft+, vemula2018social, sadeghian2019sophie, zhang2019sr, kosaraju2019social] to explore the key interactions of the crowd. In addition, the spatial-temporal graph model is applied to jointly model the temporal clues and social interactions [huang2019stgat, ivanovic2019trajectron, mohamed2020social, Sun_2020_CVPR2, yu2020spatio, salzmann2020trajectron++]. Beyond social interactions, many methods incorporate the physical environment interactions by introducing the map images [sadeghian2019sophie, lee2017desire, kosaraju2019social, mangalam2021goals, dendorfer2021mg]. Recently, some methods analyze the effect of social interaction and show it is biased [chen2021human, makansi2021you].
Stochastic Prediction Model: Due to the inherent indeterminacy of human behavior, Many stochastic prediction methods are proposed to model the multi-modality of future motions. Some methods [gupta2018social, Fang_2020_CVPR, kosaraju2019social, Sun_2020_CVPR, sadeghian2019sophie, zhao2019multi, dendorfer2021mg] employ GANs [goodfellow2014generative] to model the multi-modality with a noise variable, and another line of methods [salzmann2020trajectron++, ivanovic2019trajectron, lee2017desire, Chen_2021_ICCV, tang2019multiple, Liu_2021_ICCV] apply the CVAE [sohn2015learning] instead. Besides, some methods [liang2020simaug, liang2020garden, deo2020trajectory]
propose to learn the grid-based location encoder for multi-modal probability prediction. Recently, the goals of pedestrians[mangalam2021goals, zhao2020tnt, mangalam2020not, zhao2021you] are introduced in the trajectory prediction system as condition to analyze the probability of multiple plausible endpoints. While remarkable progress have been made, these stochastic prediction methods have some inherent limitations, e.g., the unstable training or unnatural trajectories. In this paper, we propose a new stochastic framework with motion indeterminacy diffusion, which formulates the trajectory prediction problem as a process from an ambiguous walkable region to the desired trajectory.
Denoising Diffusion Probabilistic Models: Denoising diffusion probabilistic models (DDPM) [ho2020denoising, sohl2015deep], as known as diffusion models for brevity, are a class of deep generative models inspired by non-equilibrium thermodynamics. It is first proposed by Sohl-Dickstein et al. [sohl2015deep] and attracts much attention recently due to state-of-the-art performance in various generation tasks including image generation [ho2020denoising, nichol2021improved, dhariwal2021diffusion, choi2021ilvr], 3D point cloud generation [zhou20213d, luo2021diffusion], and audio generation [kong2020diffwave, chen2020wavegrad, popov2021grad]. The diffusion models generally learn a parameterized Markov chain to gradually denoise from an original common distribution to a specific data distribution. In this paper, we introduce the diffusion model to simulate the variation of indeterminacy for trajectory prediction, and design a Transformer-based architecture for the temporal dependency of trajectories.
3 Proposed Approach
In this section, we introduce our MID method, which models stochastic trajectory prediction task by motion indeterminacy diffusion. We first explicitly formulate the indeterminacy variation as a reverse diffusion process. Then we describe how to train this diffusion model using the variational inference. Finally, we present the detailed network architecture of our method shown in Figure 2.
3.1 Problem Formulation
The goal of pedestrian trajectory prediction is to generate plausible future trajectories for pedestrians based on their prior movements. The input of the prediction system is the history trajectories in a scene such that , , where the is the 2D location at timestamp , denotes the length of the observed trajectory, and the current timestamp is . Similarly, the predicted future trajectories can be written as . For clarity, we use and without the superscript for the history and future trajectory in the following subsections.
3.2 Motion Indeterminacy Diffusion
Due to the indeterminacy of human behavior, one person has multiple plausible paths in future state. Thus, we present a new framework to formulate the stochastic trajectory prediction by motion indeterminacy diffusion. Unlike other stochastic prediction methods that add a latent variable on the trajectory feature to obtain indeterminacy, our MID generates the trajectory by gradually reducing the indeterminacy from all walkable areas to the determinate prediction with a parameterized Markov chain.
As shown in Figure 1, given the initial ambiguous region under the noise distribution and the desired trajectory under the data distribution, we define the diffusion process as , where is the maximum number of diffusion steps. This process aims to gradually add the indeterminacy until the ground truth trajectory is corrupted into a noisy walkable region. On the contrary, we learn the reverse process as to gradually reduce the indeterminacy from to generate the trajectories. Both diffusion and reverse diffusion processes are formulated by a Markov chain with Gaussian transitions.
First, we formulate the posterior distribution of the diffusion process from to as:
are fixed variance schedulers that control the scale of the injected noise. Due to the notable property of the Gaussian transitions, we calculate the diffusion process at any stepin a closed form as:
where and . Therefore, when is large enough, we approximately obtain that . It indicates that the signal is corrupted into a Gaussian noise distribution when gradually adding noise, which conforms to the non-equilibrium thermodynamics phenomenon of diffusion process.
Next, we formulate the trajectories generation process as a reverse diffusion process from noise distribution. We model this reverse process by parameterized Gaussian transitions with the observed trajectories as condition. Given a state feature learned by a temporal-social encoder parameterized by with the history trajectories as input, we formulate the reverse diffusion process as:
where is an initial noise Gaussian distribution, and denotes the parameter of the diffusion model. Both parameters of diffusion model and encoder network are trained using the trajectory data. Note that we share the network parameters for all transitions. As shown the previous work [ho2020denoising], the variance term of the Gaussian transition can be set as . This setting denotes the upper bound on reverse process entropy for data and shows good performance in practice [sohl2015deep].
3.3 Training Objective
Having formulated diffusion and reverse diffusion processes, we describe how to train the diffusion model. To predict the real trajectory , the desired training should optimize the log-likelihood in the reverse process. However, the exact log-likelihood is intractable, we thus maximize the variational lower bound for optimization:
We utilize the negative bound as the loss function and perform the training by optimizing it as:
Here we describe how to calculate the first term . The posterior in is tractable and can be represented by Gaussian distribution as:
where the closed form of and is calculated as:
where and are coefficients with no effect on the gradient direction. Note that the second term can also be formulated as the form in (8) when . Finally, we apply the parameterization method as shown in [ho2020denoising] to reparameterize:
and obtain a simplified loss function as:
where , and the training is performed at each step . (Please see the detailed derivation and detail algorithms in the supplementary material.)
Once the reverse process is trained, we can generate the plausible trajectories by a noise Gaussian through the reverse process . With the reparameterization in (9), we generate the trajectories from to as:
is a random variable in standard Gaussian distribution andis the trained network whose inputs include the previous step’s prediction , state embedding , and step .
3.5 Network Architecture
Different from the widely used UNet [ronneberger2015u] in image-based diffusion models [ho2020denoising, nichol2021improved, dhariwal2021diffusion], we design a new Transformer-based network architecture for our MID. With the Transformer, the model can better explore the temporal dependency of paths for the trajectory prediction task. To be specific, MID consists of two key networks: an encoder network with parameters which learns the state embedding by observed history trajectories and their social interactions, and a Transformer-based decoder parameterized by for the reverse diffusion process. An overview of the whole architecture is depicted in Figure 2. We will introduce each part in detail in the following.
The encoder network models the history behaviors and social interactions as the state embedding . This embedding is fed into the decoder network as the condition of the diffusion model. Note that, designing the network to model social interactions is not the main focus of this work, and MID is an encoder-agnostic framework which can directly equip with different encoders introduced in previous methods. In the experiments, we apply the encoder of Trajectron++ [salzmann2020trajectron++] for its superior representation ability.
For the decoder, we design a Transformer-based architecture to model the Gaussian transitions in Markov chain. As shown in Figure 2, the inputs of decoder include the ground truth trajectory , the noise variable , the condition feature from the encoder, and a time embedding. In step , we first add noise into trajectory to get . Simultaneously, we calculate the time embedding and concatenate it with the feature of observed trajectory. Then, we apply fully-connected layers to upsample both trajectory and condition , then sum up the outputs as the fused feature. We also introduce the positional embedding in the form of sinusoidal functions on the summation to emphasize the positional relation at different trajectory timestamp
. Finally, the fused feature with positional embedding is fed into the Transformer network to learn the complex spatial-temporal clues. The Transformer-based decoder network consists of three self-attention layers to sufficiently model the temporal dependencies in trajectories, which takes the high dimension sequence as input and outputs the sequence with the same dimension. With a fully-connected layer, we downsample the output sequence to the trajectory dimension. We finally perform mean square error (MSE) loss between the output and a random Gaussian as (20) for current iteration to optimize the network. Please see the network details in the supplementary material.
In this section, we first compared the proposed method with state-of-the-art approaches on two widely-used pedestrian trajectories prediction benchmarks, then conducted ablation studies to analyze the effectiveness of key components of our MID framework and provided an analysis regarding the reverse diffusion process.
4.1 Experimental Setup
Datasets: We evaluated our method on two public pedestrian trajectories forecasting benchmarks including Stanford Drone Dataset (SDD) [robicquet2016learning] and UCY/ETH [pellegrini2010improving, lerner2007crowds].
Stanford Drone Dataset: Stanford Drone Dataset [robicquet2016learning] is a well established benchmark for human trajectory prediction in bird’s eye view. The dataset consists of 20 scenes captured using a drone in top down view around the university campus containing several moving agents like humans and vehicles.
ETH/UCY: The ETH [pellegrini2010improving] and UCY [lerner2007crowds] dataset group consists of five different scenes – ETH & HOTEL (from ETH) and UNIV, ZARA1, & ZARA2 (from UCY). All the scenes report the position of pedestrians in world-coordinates and hence the results we report are in metres. The scenes are captured in unconstrained environments with few objects blocking pedestrian paths.
|CGNS [li2019conditional]||T + I||20||15.60||28.20|
|SimAug [liang2020simaug]||T + I||20||10.27||19.71|
|Y-Net [mangalam2021goals]||T + I||20||8.97||14.61|
|Y-Net [mangalam2021goals]+ TTST||T + I||10000||7.85||11.85|
|SoPhie [sadeghian2019sophie]||T + I||20||0.70||1.43||0.76||1.67||0.54||1.24||0.30||0.63||0.38||0.78||0.54||1.15|
|CGNS [li2019conditional]||T + I||20||0.62||1.40||0.70||0.93||0.48||1.22||0.32||0.59||0.35||0.71||0.49||0.97|
|Social-BiGAT [kosaraju2019social]||T + I||20||0.69||1.29||0.49||1.01||0.55||1.32||0.30||0.62||0.36||0.75||0.48||1.00|
|MG-GAN [dendorfer2021mg]||T + I||20||0.47||0.91||0.14||0.24||0.54||1.07||0.36||0.73||0.29||0.60||0.36||0.71|
|Y-Net [mangalam2021goals] + TTST||T + I||10000||0.28||0.33||0.10||0.14||0.24||0.41||0.17||0.27||0.13||0.22||0.18||0.27|
We adopted the widely-used evaluation metrics Average Displacement Error (ADE) and Final Displacement Error (FDE). ADE computes the average error between all the ground truth positions and the estimated positions in the trajectory, and FDE computes the displacement between the end points of ground truth and predicted trajectories. The trajectories are sampled 0.4 seconds interval, where the first 3.2 seconds of a trajectory is used as observed data to predict the next 4.8 seconds future trajectory. For the ETH/UCY dataset, we followed the leave one out cross-validation evaluation strategy such that we trained our model on four scenes and tested on the remaining one[gupta2018social, kosaraju2019social, huang2019stgat, salzmann2020trajectron++]. Considering the stochastic property of our method, we used Best-of-N strategy to compute the final ADE and FDE with .
Implementation Details: We devised a three-layers Transformer as the core network for our MID, where the Transformer dimension is set to 512, and 4 attention heads are applied. We employed one fully-connected layer to upsample the input of the model from dimension 2 to the Transformer dimension, and another fully-connected layer to upsample the observed trajectory feature to the same dimension. We utilized three fully-connected layers to progressively downsample the Transformer output sequence to the predicted trajectory, such that d-d-d. The training was performed with Adam optimizer, with a learning rate of and batch size of . All the experiments were conducted on a single Tesla V100 GPU.
4.2 Comparison with state-of-the-art methods
We quantitatively compare our method with a wide range of current methods. As shown in Table 1, we provide the comparison between our method and existing methods on the Stanford Drone dataset. We categorize methods as Trajectory-Only methods (T) and Trajectory-and-Image (T+I) methods, as the additional image information may be crucial at certain circumstances yet increases computation cost. Besides, we also report the sampling number since adding the sampling number can effectively promote the performance. We provide the results under standard 20 samplings of MID and other methods for a fair comparison. We observe that our method achieves an average ADE/FDE of / in pixel coordinate, which achieves the best performance among all the current methods, regardless of the involvement of image data. Specifically, our MID outperforms the current state-of-the-art T+I method Y-Net+TTST on the ADE metric. Note that our method did not use the image data and apply any post-processing such as the Test-Time Sampling Trick (TTST) [mangalam2021goals]. We provide the results with the sampling trick in the supplementary.
We also conducted experiments on the ETH-UCY dataset and tabulated the results in Table 2. Our method achieves a comparable performance with only trajectory input in 20 sampling, with an average performance of ADE and FDE. We found that MID are benefited more on the larger dataset (e.g. the SDD dataset).
4.3 Ablation Studies
In this subsection, we conducted ablation studies to investigate the effectiveness of each key component including diffusion model and Transformer architecture. Then, we provided a detailed analysis for reverse diffusion process.
Diffusion Model: In order to examine the importance of our diffusion model, we degraded our MID into a CVAE based framework, Trajectron++. We replaced the decoder from commonly-used LSTM to our Transformer in this CVAE based framework to verify whether the performance boost comes from the Transformer. The group 2 and 4 in Table 3 show the performance comparison. We observe that using the same encoder and decoder but without our diffusion model, the results degrade significantly, demonstrating the effectiveness of our diffusion model. Besides, only replacing the decoder with our Transformer architecture in the CVAE based framework does not improve performance as shown in group 4 in Table 3.
Transformer Architecture: We also conducted experiments on the decoder architecture of MID. According to group 1 and 3 in Table 3, Transformer outperforms the Linear and LSTM architecture by a large margin. It indicates the Transformer architecture is effective for MID to model the temporal dependencies of trajectory. Besides, we evaluated the Transformer architectures with different dimensions. As tabulated in Table 3 group 1 and 2, we observe that the Transformer with 512 dimensions leads to the best performance, and further increasing the Transformer dimension or model parameters does not yield better results.
Analysis of Reverse Diffusion Process: To further explore the reverse diffusion process, we generated trajectories at each reverse diffusion step and analyzed the gradual change of the distribution. We provide an analysis between the reverse diffusion step and the corresponding diversity and ADE/FDE, as illustrated in Figure 3. The trajectory diversity is calculated as the average of Euclidean distance between any of the two in the generated trajectories. When the reverse diffusion step is small, the trajectory distribution is more indeterminate and produces highly diverse trajectories. As the reverse diffusion step increases, we observe the decline in diversity and the rise of determinacy. With our MID framework, we can control the degree of indeterminacy by adjusting the step numbers, and achieve a flexible trade-off between the diversity and determinacy of the generated trajectories.
In addition, we visualize the distribution of trajectories as contours in Figure 4 and each contour map is sampled by ten steps interval. We see that the contours are diverse at the early stage of the diffusion process, and deform gradually to be more concentrated and fit to the ground truth trajectory.
4.4 Qualitative Evaluation
We further investigated the ability of our framework by the qualitative results. Figure 5 illustrates the most-likely predictions of our MID and Trajectron++ [salzmann2020trajectron++] on all five scenes on the ETH/UCY dataset. The qualitative results show that both MID and Trajectron++ fits the ground truth paths well. We observe that Trajectron++ performs similarly to MID for short-term forecasting yet a little deviates from the ground truth path for longer prediction. Besides, we visualize multiple predicted trajectories on SDD in Figure 6. We observe that all predictions show their feasibility conditioned on the observed trajectories. Though reducing the ambiguity with the reverse diffusion model, We found that generated trajectories are still full of the diversity in a walkable region.
5 Conclusion & Discussion
In this paper, we introduced a new MID framework to formulate trajectory prediction with motion indeterminacy diffusion. In this framework, we learned a parameterized Markov chain conditioned on the observed trajectories to gradually discard the indeterminacy from ambiguous areas to acceptable trajectories. By adjusting the length of the chain, we can achieve the trade-off between diversity and determinacy. Besides, we designed a Transformer-based architecture as the core network of our method to model complex temporal dependency in trajectories. Experimental results demonstrate the superiority of our method which achieves state-of-the-art performance on the Stanford Drone and ETH/UCY benchmarks.
Broader Impact: MID could be applied to a wide range of applications with human-robots interaction. With indeterminacy modeling, we can generate accurate and reasonable future trajectories, which helps much with decision making in auto-driving. Besides, MID can adjust the degree of indeterminacy, which has the potential to be applied in dynamic and interactive environments.
Limitations: Despite the promising performance and an applicable trade-off nature, the time cost at reverse diffusion process could be expensive due to multiple steps (100 steps in our experiments). When evaluated with 512 trajectories on the ZARA1 dataset, Trajectron++ needs but MID will need with 100 diffusion steps setting. Fortunately, many recent efforts have been made to significantly reduce the sampling cost while keeping the high generation performance [nichol2021improved, song2020score, jolicoeurmartineau2021gotta, san2021noise, watson2021learning]. However, plugging these methods in our MID is not trivial. We leave it as future work to build a more efficient system.
This work was supported in part by the National Natural Science Foundation of China under Grant 62125603, and Grant U1813218, in part by a grant from the Beijing Academy of Artificial Intelligence (BAAI).
Appendix A Detailed Derivations
a.1 Derivations of Loss Function
We give the derivations to obtain our loss function as:
We ignore the last term because it has no learnable parameters and get the loss function as:
a.2 Derivations of Reparameterization
As shown in the loss function, we should match the reverse transition and the ground-truth , both of which are in Gaussian. We can convert the KL divergence of two Gaussian distributions as the difference of the means. We calculate the mean of posterior in a closed form:
where and . By the reparameterization, we formulate the as a function of and :
where is a random variable, and we have
Then we reformulate :
Therefore, the can be formulated as:
Then we show why the last term is tractable with the same formulation form of at . The term means that the outputs of prediction model should follow the distribution of real data. Considering the reverse transition is Gaussian, we also revert this loss as the difference between the mean of Gaussian transition and the ground truth as . Moreover, for the under , we have
With (16), we get , which demonstrates the of losses with both and are in the same form.
As shown in (17) and (18), the loss function expects the model to predict given the inputs and . Since the is the input, we only need a network to predict as . Thus, the final loss function is formulated as:
where denotes we further consider the encoder network in the loss function. Once the network is trained, we can use this network to obtain the mean of Gaussian transition.
Furthermore, the trajectory in next step is predicted as:
Appendix B Implementation Details
In this section, we introduce the implementation details of our method, including the hyper-parameters for training, the network architecture, the algorithms of training and inference, and the attached code.
Diffusion Process and Hyper-parameters: We set the lower bound of variance scheduler to 0.0001 and upper bound to 0.05, and is uniformly sampled between the bounds. For the main Transformer network in diffusion model , we devise three Transformer Encoder layers where each has the dimension of 512, feedforward dimension of 1024 and 4 attention heads. For the encoder , we utilize the default configuration provided by Trajectron++ [salzmann2020trajectron++].
Upsample-Downsample Layers: We employ a MLP-based sub-network to upsample the raw trajectory from 2d to 512d, and downsample the output of the Transformer such that 512d-256d-2d as the final output of the network. Each sub-network, denoted by and parameterized by , contains three MLP layers which we can formulate as:
is the concatenation of step number embedding and state embedding such that and denotes the input trajectory feature of the sub-network. , , and , , are the trainable parameters of the MLP layers.
corresponds to a sigmoid function.
Appendix C Additional Experiments
We also respectively report the ADE (left) and FDE (right) curves of min_3/min_5 metrics within reverse diffusion steps from 0 to 100 in Figure 7. We can observe reducing the diversity also leads to better predictions with fewer samples, which demonstrates that diversity and determinacy are still contradictory with few samples.
Additionally, we found that the sampling trick is very effective to improve model performance. Sampling tricks usually add the number of sampling and do the post-processing (clustering in YNet [mangalam2021goals] and choosing best in Expert [zhao2021you]). As shown in Table 4, the performance is improved significantly when we add the number of sampling like Expert. However, we don’t encourage to use more samplings since more samplings indicate more computation cost.