Code for "Long-Term Pedestrian Trajectory Prediction Using Mutable Intention Filter and Warp LSTM", Huang et al., RA-L 2020
Trajectory prediction is one of the key capabilities for robots to safely navigate and interact with pedestrians. Critical insights from human intention and behavioral patterns need to be effectively integrated into long-term pedestrian behavior forecasting. We present a novel intention-aware motion prediction framework, which consists of a Residual Bidirectional LSTM (ReBiL) and a mutable intention filter. Instead of learning step-wise displacement, we propose learning offset to warp a nominal intention-aware linear prediction, giving residual learning a physical intuition. Our intention filter is inspired by genetic algorithms and particle filtering, where particles mutate intention hypotheses throughout the pedestrian motion with ReBiL as the motion model. Through experiments on a publicly available dataset, we show that our method outperforms baseline approaches and the robust performance of our method is demonstrated under abnormal intention-changing scenarios.READ FULL TEXT VIEW PDF
Making accurate motion prediction of surrounding agents such as pedestri...
In this extended abstract, we investigate the design of learning
In order to be globally deployed, autonomous cars must guarantee the saf...
In crowd scenarios, reliable trajectory prediction of pedestrians requir...
Recurrent neural networks are able to learn complex long-term relationsh...
The prediction of humans' short-term trajectories has advanced significa...
We develop a novel human trajectory prediction system that incorporates ...
Code for "Long-Term Pedestrian Trajectory Prediction Using Mutable Intention Filter and Warp LSTM", Huang et al., RA-L 2020
As humans, we make effective predictions of pedestrian trajectories over long time horizons even when novel behavior is present 
, and continually estimate people’s underlying goals or intent from subtle motion. If robots are to similarly navigate and interact safely with people, reliable forecasts of human behavior are required . For long-term prediction of pedestrian trajectories, two key challenges are posed for robots to emulate human performance. The first challenge is to build a generative model that incorporates intention information and captures social norms. The second challenge is to figure out the correct intention given the observation on pedestrian motion.
Pioneering work applied both model-based [16, 41, 49, 26] and data-driven techniques [1, 14, 50, 11] to the pedestrian behavior modeling challenge. The intention inference challenge is often posed as either a classification problem [43, 48, 12, 33] or a filtering problem [34, 29, 9]. While these previous studies have made significant progress, we see several potential limitations nonetheless. First, the influence on trajectory prediction from human intention and pedestrian motion pattern is nontrivial to balance. Model-based methods often require numerous parameters to be tuned in order to comprehensively model pedestrian behavior. Alternatively, simple methods may have insufficient features that could lead to a relatively trivial prediction, such as the orange path shown in Fig. 1
. As for learning-based methods, the increasingly popular recurrent neural network (RNN) is often used to predict the displacement between pedestrian positions in neighboring frames. Accumulation of error on displacement prediction typically results in drift of long-term pedestrian trajectory prediction, as illustrated by the blue path deviating from the ground truth in Fig. 1. The intention may be concatenated with other high-level features as the input to the RNN , but we find this approach struggles to guide the RNN to locate the end of prediction within the correct goal region, and the drifting issue still exists.
Another common limitation is using only recent observations, which wastes previous observation history. Assuming a pedestrian can be tracked once it is detected, the available trajectory observations grow until the pedestrian reaches the goal. However, using a short sliding window of trajectory data as the input to prediction model is a typical choice for learning-based methods . The prediction results from the last sliding window may not be used to assist the prediction on the current sliding window. Lastly, changes in the human intention may exhibit unseen pedestrian behavior, which is often challenging to predict with data-driven approaches . When a pedestrian is walking along the red path as shown on Fig. 1, the sudden turning could be detected as anomalous because prediction deviates too much from the ground truth [11, 30]. Either a more conservative control policy may be switched on to help the robot cautiously avoid the pedestrian , or the robot may attempt to learn the novel motion pattern online .
To overcome these limitations, we propose an intention-aware trajectory prediction framework (see Fig. 2), which fuses a Residual Bidirectional LSTM (ReBiL) and a mutable intention filter. Inside the framework, ReBiL predicts the trajectory using the estimated intention from the mutable intention filter, while the mutable intention filter infers the intention depending on trajectory samples generated by ReBiL. Instead of learning sequential displacement, ReBiL learns the offset to warp a nominal prediction from an intention-aware linear model, such as the white path in Fig. 1. Thus, the intention information is captured as the nominal prediction input to ReBiL. With this formulation, we endow the residual with a physical insight. The offset between the nominal and the final prediction represents the warping effect of map and human tendency on a goal-directed pedestrian trajectory. ReBiL requires goal position information from the correct intention to generate reliable prediction, so a mutable intention filter is built based on the particle filter developed in our previous work , which continuously filters pedestrian intentions throughout pedestrian motion. An intention mutation mechanism is introduced from genetic algorithms to make the framework resilient under intention-changing conditions. Our contributions are fourfold:
A residual network structure is introduced to predict pedestrian trajectory by learning the offset from intention-aware nominal prediction.
We apply a bidirectional LSTM to propagate physical intention information back through the whole trajectory.
A genetic-inspired intention filter framework is proposed to robustly perform trajectory prediction even for unseen pedestrian behavior.
We demonstrate that our method surpasses baselines on a publicly available dataset.
This paper is organized as follows. Section II summarizes learning-based methods for pedestrian trajectory prediction and the methodology related to our work. Section III formulates the problem, and describes ReBiL and mutable intention filter in details. Section IV elaborates on two experiments for ReBiL and mutable intention filter, and discusses results with trajectory visualization and quantitative evaluation. Our conclusions and future work are presented in Section V.
RNN for Pedestrian Trajectory Prediction. The power of RNNs in generating sequences is used to model pedestrian behavior from various perspectives . For short-term human-human interaction, many structures like social pooling layers , Generative Adversarial Network (GAN) 
, Conditional Variational Autoencoder (CVAE), and spatio-temporal graphs  are proposed to encode hidden states of neighbors through an RNN from observed trajectories. Many research efforts focus on integrating contextual cues into RNNs for long-term trajectory prediction, including intent and map information [35, 20]
. Convolutional Neural Networks are used to extract map information from scene images or high-definition semantic maps . The distance from humans to static obstacles may be encoded to introduce the influence from the obstacles to the humans 
. As for intent, the probability distribution over possible goal regions may be used to select among RNNs trained for different intentions, or used as input to the downstream of the architecture .
Residual LSTM. Shortcut connection between neural network layers builds a gradient highway to help an extremely deep CNN learn effectively through back propagation 
. This gradient highway idea is also investigated for RNN. LSTM itself provides an uninterrupted gradient flow between cell states in temporal domain to alleviate vanishing or exploding gradient problems. Stacked residual LSTM is proposed for phrase generation tasks, where shortcut paths are added between LSTM layers and attempts to achieve efficient training on deep LSTM networks . Similar structures are explored in various sequential tasks [23, 18, 44, 40, 45, 51, 22, 46]. The motivation of applying the residual idea to LSTM in these previous work is to make training more efficient by building a smooth gradient flow for deeper networks. To the best of our knowledge, our method is the first to explore the physical intuition of residual in sequence tasks.
Intention Filter. In practice, trajectory data is recorded once the pedestrian is detected. Information from previously recorded data is essential to infer key properties of long-term pedestrian motion such as the intention [52, 25]
. Kalman filter based on interacting multiple models takes into account different pedestrian motion types and has been applied to pedestrian intention recognition 
. Particle filter is an alternative filtering approach that uses particles (i.e., samples) to model various distributions over pedestrian goals besides the Gaussian distribution[34, 29, 9].
In this work, we assume that a human’s motion is determined by an unknown intention , which denotes a desired goal region. The final position of human trajectory is located in the human’s intention . The intention belongs to a finite set , which is given as prior knowledge of the map . At timestamp , the observation history of the human’s position is available. Our goal is to simultaneously infer the pedestrian’s intention from observations and predict the human’s future position . Here, denotes a global position and represents a lookahead time window.
We divide trajectory prediction into two problems corresponding to the challenges discussed in Section I. The first problem is given the intention and observation, how to effectively predict human trajectory? The second problem is how to correctly estimate the intention online based on the recorded trajectory? We introduce Residual Bidirectional LSTM (ReBiL) and mutable intention filter to simultaneously solve these problems. Fig. 2 shows the architecture of the entire framework. The mutable intention filter requires a motion model to update the belief on possible intentions, while trajectory prediction utilizes the estimated intention from the filtering process. We will show in Section IV that the more accurate motion modeling of ReBiL introduces less noise to intention inference, and the intention robustly estimated from the mutable intention filter offers valuable information to trajectory prediction.
When applying LSTM to trajectory prediction, a common technique is to predict human motion using displacement between neighboring time steps in place of global coordinates [13, 47]. This technique transforms trajectory data into a more standardized format that is easier to learn for LSTM. We develop another standardization concept named offset. The offset is defined as the difference between human’s trajectory and nominal prediction at each time step. The offset is learned through residual learning, and is regarded as physical residual that resolves the drift issue.
We assume the ground truth goal position is known.111This assumption does not hold in practice when ReBiL is integrated with the mutable intention filter. The current position is connected with
by an intention-aware linear model (iLM). The generated path is discretized based on a heuristic that uses the average magnitude of position displacement from the observation. In this situation, the lookahead time window becomes , which denotes the remaining time steps for the pedestrian to reach the goal. The discretized positions are called the nominal prediction and reflect the fact that people attempt to reach their desired goals with minimum effort . However, the straight path is relatively trivial (see orange path in Fig. 1), since long-term path deviation due to physical constraints and personal preferences on how to reach the goal are not considered [36, 37, 7].
We introduce a residual module as shown in Fig. 3 to take into account map information and pedestrian motion pattern in a data-driven manner. Ideally, we desire an underlying mapping to map a not-very-impressive prediction (e.g. nominal prediction) to the ground truth trajectory . Instead of learning directly, we attempt to train , which essentially learns the offset to warp the original prediction . The prediction from ReBiL is guaranteed to be no worse than the nominal prediction represented by an identity mapping.
In the first residual module, we concatenate observation and nominal prediction to form the global coordinate input . a linear layer embeds into . A bidirectional LSTM is applied to encode the trajectory. Both the observation from the past and the goal information from the future are integrated into hidden states using LSTM along both forward and backward directions. The hidden states are decoded to offset output by a linear layer .222The offset output includes the offset prediction on the observation data . We found that a smooth offset is learned by treating observation and nominal prediction equally.
A skip connection is built to sum the input with the offset output and get , which could be the input to the second residual module. We can stack residual modules in ReBiL, and then becomes the final output, from which
becomes the final prediction. The loss function of ReBiL is L2 loss that measures the distance between output and the ground truth trajectory.
As pedestrian trajectory prediction is a multimodal problem , we propose a mutable intention filter that applies particle filtering to generating multiple prediction samples with different hypotheses on pedestrian intentions. Moreover, the mutable intention filter can yield a probability distribution of potential intentions that converges to the correct intention, even if the pedestrian changes its intention during motion.
When a pedestrian is detected, particles are initialized with normalized uniform weights and intention hypotheses
are uniformly distributed among the particles. To inject randomness for theth particle, goal position is randomly sampled from the goal region hypothesis , and the heuristic remaining time steps to reach the goal also has uniform noise added. The motion model in the mutable intention filter is:
At the beginning of filtering iteration at time and given observation , we treat as the input and as the desired output. The th particle uses , , and to create a corresponding prediction sample . The sample is truncated within the lookahead time window to get . We update the weight based on the L2 distance between the sample and the ground truth :
is a hyperparameter that tunes exploration and exploitation among potential intention hypotheses. Lower deviation leads to larger weight during the weight update step.
Sequential Importance Resampling (SIR) is implemented after the weight update to avoid sample degeneracy . Particles are resampled based on updated weights. The intention hypotheses are inherited from the last generation, whereas the goal positions and remaining time steps are not. The weights of particles in new generation are again uniform. The number of particles is fixed throughout the resampling process. New particles create prediction samples similar to the intention inference process. However, with complete input , goal position resampled from new ’s, and reinitialized remaining time steps , new prediction samples are generated to form the multimodal prediction of intention filter at time . We also sum up the weights of particles with the same intention hypotheses to obtain the probability distribution over intentions at time .
In order to prevent premature convergence and to address intention-changing cases, a mutation mechanism inspired by genetic algorithms is introduced to the intention filter. After SIR, the inherited intention has a small possibility of mutating to a different intention, which imitates the scenario when a pedestrian changes its destination midway through the trajectory. We demonstrate in Section IV that the mutation mechanism enables the intention filter to adaptively predict pedestrian trajectories under intention-changing scenarios.
We present two experiments to evaluate our method. The first experiment tests performance of ReBiL given a goal position and remaining time . The second experiment is focused on the intention filter framework implementation in a practical case where is sampled from intention and is estimated by heuristics. Both experiments use the preprocessed Edinburgh dataset , which contains 810 full-length pedestrian trajectories at a frame rate of 10Hz, with the same start region at bottom right and three different goal regions (Fig. 4). Pedestrian trajectories are multimodal due to different goal positions , map constraints, and personal preferences in pathways. The pedestrian trajectories are split into training dataset (80%) and test dataset (20%).
The embedding dimension for global coordinates is 64 in the residual module. The dimension of hidden states is 64 for each direction in LSTM. The default number of residual modules is 1. The Adam optimizer with an initial learning rate of 0.001 is used to train ReBiL . is set to 0.3 for the mutable intention filter, and the mutation probability is set to 0.01. The filtering process iterates at each time step. The default lookahead time window is 12, and the default number of particles is 600.
The prediction performance is quantified by five different metrics, where the first two serve Experiment 1, and the remaining three serve Experiment 2.
Average Offset Error (AOE): The average of L2 distance between the predicted and ground truth trajectories at each time step of the prediction period .
Max. Offset Error (MOE): The maximum value of L2 distance between the predicted and ground truth trajectories across all time steps in the prediction period.
Max. Prob. AOE/MOE: The mean AOE and MOE of prediction samples with maximum probability intention hypothesis (MPI), which indicates how the motion prediction framework works in practice.
The baselines that our method will be compared against are as follows:
Linear Model (LM): A linear model that predicts displacement using the last observed displacement based on the constant velocity assumption.
LSTM: A vanilla LSTM trained to predict displacement.
Intention-aware Linear Model (iLM): A linear model that outputs the straight line connected between last observed position and goal position as prediction. This also serves as the nominal method for ReBiL.
Intention-aware LSTM (iLSTM): A LSTM trained with an additional input of goal position to predict displacement.
The first experiment is to study properties and performance of ReBiL under deterministic conditions as given the goal position and the remaining time steps . Since trajectory prediction will be executed from when a pedestrian is detected to when it reaches the goal, we choose four representative percentages of trajectories (0%3330% observation is equivalent as the position and the displacement at the first time step., 25%, 50%, and 75%) to split a full-length trajectory into observation and prediction in order to investigate prediction algorithms at different stages along the trajectory. The prediction performance is presented in Table I. As the percentage of observation data increases, all prediction algorithms tend to gain more information and exhibit better performance. We observe that ReBiL surpasses all other baseline methods across all stages.
Fig. 5 visualizes predicted trajectory samples across three intentions and four stages. Prediction from ReBiL is one of the closest to trajectories across all different scenarios, and demonstrates the effectiveness of learning the offset from the nominal iLM prediction results by the residual module. In the first column, we see that LM and LSTM may produce results with large deviation from the ground truth, since the intention information is difficult to extract from 0% observation data, which includes only the displacement at the first time step. Thanks to the goal position input, iLSTM does not suffer from this problem like LSTM, and predicts the trajectory along the correct direction with 0% observation. However, naively feeding goal position to iLSTM cannot mitigate error accumulation during sequential displacement prediction. Contrary to the baseline results, ReBiL overcomes the drifting limitation by warping iLM prediction. As the final position of the nominal prediction is the ground truth , we observe that ReBiL outputs near-zero mapping at the last time step to keep clipping the end to , and thus prevents the drifting. Additionally, we find that the bidirectional structure works slightly better than the unidirectional counterpart with other configurations fixed. The configuration study also reveals that deeper ReBiL do not degrade the performance by virtue of the residual structure, which conforms with the findings in .
The second experiment is conducted on the mutable intention filter. The ground truth remaining time steps and the goal position are unknown to the framework. We integrate ReBiL trained in Experiment 1 for each trajectory stage with the mutable intention filter. The appropriate model is selected among the models trained in different stages based on the heuristic estimate of the remaining portion of the trajectory.
Table II shows the filtering performance evaluated by three metrics when different methods are chosen for motion modeling.444Intention is required to apply heuristics to remaining time steps, so LSTM and LM are not listed in Table II. The percentage range indicates the range of observed trajectories where the computed filtering performance is averaged. ReBiL outperforms other baselines on almost all ranges using all metrics. This result indicates that enhanced motion modeling overall provides superior prediction capability under the mutable intention filter framework. More accurate modeling leads to lower interference in intention inference, as it relies on deviation caused by different intention hypotheses.
|Method||Min. AOE/MOE (m)||Max. Prob. AOE/MOE (m)||NLL|
In particular, Min. AOE/MOE reflects the lowest error from all prediction samples. The mutation mechanism guarantees the correct intention hypothesis exist among particles, so Min. AOE/MOE yields results close to results from Table I, which is obtained under ideal deterministic conditions. While Min. AOE/MOE suggest the upper limit of prediction quality, Max. Prob. AOE/MOE is more closely related to practical use, as it is common to take advantage of the filtered maximum probability intention hypothesis (MPI) to create prediction samples. Max. Prob. AOE/MOE gets closer to Min. AOE/MOE when more observations are available. This phenomenon may be a result of the mismatch between MPI and the true intention when the observations are insufficient. NLL is another metric that summarizes the extent of spread and deviation of prediction samples from the ground truth. We see that though iLSTM is the best during the 0-25% range (lower NLL is better), NLL of iLSTM does not follow the trend of decreasing NLL as more trajectory history is recorded. Directly providing as input does not capture physical goal position as effectively as iLM or ReBiL. Consequently, longer observation input may “distract” the iLSTM from intention information, causing the iLSTM to generate a less stable prediction with a randomly sampled .
We investigate the mutable intention filter’s capability of adapting to abnormal trajectory scenarios. A small dataset of 23 intention-changing trajectories are extracted from the Edinburgh dataset , and are annotated at each time step with the perceived intention. The intention annotations are used to evaluate intention inference accuracy, and to test the responsiveness of the mutable intention filter to intention-changing scenarios. Fig. 6 illustrates the filtering process on an abnormal trajectory sample. Fig. 6b shows that the intention filter without the mutation mechanism quickly converges to one intention. As particles with alternate intention hypotheses have died out, the intention filter fails to recover from the convergence. The mismatch between ground truth intention and MPI is maintained until the end, and the prediction yields large error during the entire mismatch period. For example, at , the intention filter generates prediction samples towards the bottom left in Fig. 6a, though we can see clearly from the observation history that the pedestrian is moving along the top right direction. In contrast, the mutation mechanism allows particles to be mutated towards different intention hypotheses with a tiny possibility. Mutation ensures intention diversity through the filtering process, which is crucial to capture intention change in various abnormal scenarios as demonstrated in Fig. 6cd. There is a inherent time delay between the annotated intention change and the inferred intention change, which is due to inertia of particles in changing from one MPI to another. However, we indeed see from Fig. 6d that the mutable intention filter reacts quickly after the intention change happens.
We conduct a parametric study on the influence of particle numbers and lookahead time window over the mutable intention filter. The intention inference accuracy reported in Fig. 7 is the mean percentage of correct match between MPI and ground truth intention among all abnormal trajectories. We see that the framework equipped with the mutation mechanism improves the intention inference accuracy by 38%. A large particle number is also beneficial. Performance stability of the framework with a larger number of particles is less affected by mutation. The jump between different MPIs is usually observed with 20 particles during the beginning period (0-25%), while similar phenomena rarely happen with 200 particles. Moreover, when the mutation mechanism is not applied, a larger number of particles are less prone to premature convergence thanks to the resampling step.
A longer lookahead time window works better with the mutation mechanism. The longer time horizon more effectively captures the deviation attributed to wrong intention hypotheses. Thus, faster reaction and lower inertia can be achieved. If mutation is not applied, the high inertia owing to the short lookahead time window may hinder the intention filter from reaching complete convergence, and the filter is likely to recover from a wrong MPI. Nevertheless, a greater inertia will significantly slow down the recovering process. In summary, a larger number of particles, a longer lookahead time window, and a mutation mechanism together produces the most robust predictions in the case of abnormal pedestrian behavior.
In this work, we present a Residual Bidirectional LSTM to model long-term pedestrian behavior. Inspired from residual learning, our model captures the physical intention information and human motion patterns by learning the offset to warp a nominal prediction. In addition, we propose a mutable intention filter integrated with the Residual Bidirectional LSTM to perform pedestrian intention inference. A mutation mechanism is introduced to improve the robustness of the framework in abnormal trajectory scenarios. We demonstrate that the proposed model and framework outperforms baseline methods on a publicly available dataset.
While we have shown promising results in modeling and filtering experiments, several directions remain open for future investigation. Firstly, in the present work, we only consider the case of a fixed number of intentions. To enable greater applicability of our method, an extension to an arbitrarily sized set of intentions will be studied. Second, human-human interaction is not taken into account in our current framework. In the future we would like to explore the domain of long-term trajectory prediction in multi-pedestrian scenarios, and integrate global-scale goal-directed motion with local-scale human-human interaction within a unified framework.
Social attention: modeling attention in human crowds. In Proc. IEEE Int. Conf. Robot. Autom., pp. 1–7. Cited by: §II.