The applications of pedestrian trajectory prediction cover a broad range from autonomous driving, robot navigation, smart video surveillance to object tracking. Traditionally, the task of object motion prediction is done by using a Bayesian formulation in approaches such as the Kalman filter, or nonparametric methods, such as particle filters 
. Driven by the success of recurrent neural networks (RNNs) in modeling temporal dependencies in a variety of sequence processing tasks, such as speech recognition[13, 10] and caption generation [12, 27], RNNs are increasingly utilized for object motion prediction [2, 3, 15, 16, 7]. When relying on traditional approaches, the challenge of varying dynamics over time is commonly addressed with the Interacting Multiple Model (IMM) filter 
. The IMM filter is a well established approach to elegantly combine a set of candidate models into a single context by weighting each individual model. Each model corresponds to a specific motion pattern and contributes to the final state estimation depending on its current weight. According to the IMM filter solution, in this paper an RNN-based IMM filter surrogate is presented. On the one hand, the presented RNN-based model is able to also provide a confidence value for the performed dynamic and on the other hand can overcome some limitations of the classic IMM filter. The suggested RNN-encoder-decoder model generates the probability distribution over future pedestrian paths conditioned on a dynamic class. The model is based on the work of Deo and Trivedi. For the case study of freeway traffic, they used an RNN-encoder-decoder network for vehicle maneuver and trajectory prediction. In the context of vehicle motion prediction, maneuver or rather dynamic classes can be better defined than for pedestrians. Due to the dynamic behavior of pedestrians, the maneuver classes are here defined based on the deviation of a straight walking pedestrian. The presented network extends their maneuver network with insights from the work of Becker et al. 
and furthermore infers the filtered current position. Moreover, this paper aims to highlight some relations between traditional multiple model approaches such as the IMM filter and the suggested RNN-based IMM filter surrogate. By combining the different views on maneuver predictions, this works contributes to an exploration of the connections between both problem formulation. The decoder uses the de-noised position estimate and a context vector, encoding the dynamic classes, to predict future positions. The analysis is done on synthetic data reflecting prototypical scenarios capturing pedestrians maneuvers.
In the following, a brief formalization of the problem and a description of the RNN-based model are provided. The achieved results are presented in section 3. Finally, a conclusion is given in section 4.
2 RNN-based IMM Filter Surrogate
The goal is to devise a model that can successfully predict future paths of pedestrians and represent alternating pedestrian dynamics, e.g. dynamics that can transition from a straight walking to a turning maneuver or stopping. Here, trajectory prediction is formally stated as the problem of predicting the future trajectories of a pedestrian, conditioned on its track history. Given an input sequence of consecutive observed pedestrian positions at time along a trajectory, the task is to generate a multi-modal prediction for the next positions and to filter the current position . One insight from the work Becker et al.  is that motion continuity is easier to express in offsets or velocities, because it takes considerably more modeling effort to represent all possible conditioning positions. In order to exploit scene-specific knowledge for trajectory prediction, additional use of the position information is required. When sufficient training samples from a particular scene are available, Hug et al. showed that RNN-based trajectory prediction models are able to capture spatially dependent behavior changes only from motion data. However, here the offsets are additionally used for conditioning the network . Apart from the smaller modeling effort to represent conditioned offsets, the shift to offsets helps to prevent undefined states due to a limited data range  and it is easier to make better generalizations across datasets. Since, we analyze the model capabilities on synthetic data reflecting prototypical pedestrian maneuvers for a fixed scenario, the amount of training samples is not restricted. Thus, in order to localize in the reference system position information is used to estimate the true position. The future trajectory is denoted with and the filtered position with . The model estimates the conditional distribution . In order to identify specific dynamics under desired maneuver classes (e.g. turning maneuvers, stopping and straight walking), this term can be given by:
Here, are the parameters of a
component Gaussian mixture model. By adding the maneuver context in form of the posterior mode probability,
the analogy to the classic IMM filter becomes apparent. For an IMM filter, the mode probability is used to calculate the mixing probabilities to combine the set of chosen candidate models into a merged estimate. In case of using an IMM filter, the time behavior of the basic filter set is modeled as a homogeneous (time invariant) Markov chain with a fixed transition probability matrix (TPM). Under the assumption that models describe the variation of the dynamics, the posterior density of the IMM filter can be written as follows:
is in the context of an IMM filter a Gaussian distribution and
is the posterior mode probability for the IMM filter. As mention above, the transition between different dynamics is modeled as first order Markov chain for an IMM filter. The law of total probability allows to compute new mode probabilities based on the transition probabilities. Given the current mode probabilities and transition probabilities, the weighting probabilitiesfor the mixing step of the IMM filter can be calculated. For each model and , they are calculated as with a normalization factor . Then, in the prediction stage, each filter is applied independently using the calculated mixed initial condition. Subsequently, the model probabilities are adapted according to the likelihood of each filter.
Whereas an explicit modeling of the switching behavior and the object dynamics of the IMM filter stands in contrast to an implicit dynamic encoding of an RNN-based approach. In order to provide an IMM filter surrogate, the proposed model also estimates mode probabilities and filters or rather de-noise the current position based on noisy observations . By writing the conditional distribution of the RNN-based approach in form of equation 1, the desired estimates can be inferred from the hidden states of the RNN . This formulations does not require to set the parameter of the TPM matrix manually, which is commonly done based on the mean sojourn time (the mean time an object stays in a motion type [24, 5]) or as stated in the work of Bar-Shalom , an ad-hoc approach to fill the diagonals with values close to one. For the proposed RNN-based IMM filter surrogate (RNN-IMM), the basic architecture is a recurrent encoder-decoder model. The encoder takes the frame by frame input sequence . The hidden state vector of the encoder is updated at each time step based on the previous hidden state and the current observation. The generated internal representation is used to predict mode probabilities at the current time step and . With embedding of the current observations, the encoder can be defined as follows:
Here, is the recurrent network, the hidden state of the RNN,
the multilayer perceptron, andan embedding layer.
represents the weights and biases of the MLP, EMB or respectively RNN. The final state of the encoder can be expected to encode information about the track histories. For generating a trajectory distribution over dynamic modes, the encoder hidden state is appended with a one-hot encoded vector corresponding to specific maneuvers and the filtered current position. Instead of only filtering the position, the encoder could also be used to parametrize a mixture density output layer (MDL). The decoder of the model can be defined as follows:
The decoder is used to parametrize a mixture density output layer (MDL) or rather
directly for several positions in the future. Nevertheless, the overall RNN-IMM uses the trajectory prediction and dynamic classification jointly, the loss function for training is split into three parts. Dynamic classification is trained to mimimize the sum of cross-entropy losses of the differentmotion model classes:
Additionally, the encoder is trained by minimizing the filtering loss in form of the mean squared error to the ground truth current pedestrian locations. In case the encoder should generate the parameter of a mixture of Gaussian or single Gaussian distribution, the negative log likelihood for the ground truth pedestrian locations can be minimized. Finally, the complete encoder-decoder is trained by minimizing the negative log likelihood for the ground truth future pedestrian locations conditioned under the performed maneuver class. The context vector is appended with the ground truth values of the dynamic model or maneuver classes for each training trajectory. This results in the following loss function:
The overall architecture is visualized in figure 1. The context vector combines the encoding of the track history with the encoding of the alternating dynamic classes. Together with the filtered position, it is used as input for the decoder.
3 Data Generation and Evaluation
This section consists of a brief evaluation of the proposed RNN-IMM. The evaluation is concerned with verifying the overall viability of the approach in maneuver situations. For initial results, a synthetic test condition is used in order to gain insight into the model behavior in different typical pedestrian motion types. A prototypical maneuver performed by a pedestrian, which has important implications for the field of intelligent vehicles and video surveillance is a stopping or deceleration maneuver. For the first mentioned context of intelligent vehicles, Schneider et al.  performed a comparative study on recursive Bayesian filters for pedestrian path prediction at short time horizons (below seconds). They applied different filters on typical pedestrian motion types. Although, the comparison was done on the Daimler path prediction dataset, we evaluate on synthetic data but make use of the provided real data to capture a similar condition. Firstly, the Daimler path prediction dataset provides only a maximum amount of sequences for single motion types. As mentioned before, in order to avoid problems such as a limited number of training samples and to gain some insights into a controlled setup, synthetic data is used. Secondly, the location information is biased in the dataset. Since recursive Bayesian filters make in their standard formulation no use of the spatial context of a scene, this does not harm their mutual comparison. However, RNN-based prediction networks are able to capture spatially dependent behavior changes , thus a fair comparison is difficult to achieve. The evaluation on the Daimler dataset is done in an ego-motion compensated reference system. The frame rate of the camera system inside the recording vehicle is fps and it is taken over accordingly for our experiments. The pedestrians change their behavior abruptly. Therefore, the sensible time horizons are short. Here, 8 (0.5 seconds) consecutive positions are observed, before predicting the next ( seconds), ( seconds), ( second). For generating synthetic trajectories of a basic maneuvering pedestrian, random agents are sampled from a Gaussian distribution according to a preferred pedestrian walking speed  () from the distribution of starting positions of the corresponding Daimler dataset sequences. During a single trajectory simulation the agents can perform a stopping maneuver or cross the street. Figure 3 illustrates such maneuvers with example images from the Daimler dataset . For mapping the pedestrian detections to a vehicle-motion compensated ground plane, Schneider et al. used on-board sensors for velocity and yaw rate and a stereo camera system to compute the median disparity. Due to the non-linear observation model based on a perceptive camera model, an inevitable linearized extension for the Kalman and IMM filter observation models are required. Here, the observation uncertainty of the position sensor is assumed to be Gaussian distributed in the compensated reference system. Thus, the standard formulation of the Bayesian filters are well suited for this task. For the stopping maneuver or rather the event of deceleration till standing, a mean sojourn time of second with a standard variation of 0.1 seconds is used. As long as a person moves in a straight line at a reasonably constant speed, their dynamics can be captured with a Kalman filter using a constant velocity model. During the maneuver, the relation to one fixed process model describing the dynamics fails due to an additional deceleration. Similar to Schneider et al.  or Kooij et al. , the reference IMM filter is set up by combining two basic models, in particular, the constant velocity (CV) and the constant acceleration (CA) model. For avoiding side effects due to independent motions in different directions, see for example 
, only the crossing direction, from the vehicle perspective the lateral motion is considered. Following the aforementioned explanations, the IMM-RNN is compared to an IMM filter with two motion models (CV, CA), a Kalman filter with a single CV model, a Kalman filter with a single CA model, and as baseline to a linear interpolation.
Also correspondingly to Schneider et al., the process noise is determined by , where
are spectral densities (continuous time variances) of the process noise, describing the changes in velocity or respectively in acceleration over a sampling period(CV: ; CA: , see for example ). Based on this process noise model, the optimal process noise parameters for the different chosen filters (IMM filter (CV, CA), Kalman filter CV, CA) on the Daimler dataset are for the two IMM filter models and for the single Kalman filters and . These parameters are consistent with the suggested practical setting in Bar-Shalom  and the chosen sojourn time for the simulation.
As mentioned above, a definition of maneuver classes for pedestrians is harder to establish than for vehicles. Hence, the main interest is here to detect the deviation from a standard behavior, and whether the pedestrian is in a normal mode. A set of deviation in velocity, deceleration, along with the tangential ground truth trajectory is used to assign a maneuver label to a time step of a single trajectory. Thus, the RNN-IMM and IMM filter have a similar basic dynamic model set description. As the distribution over the trajectories for the RNN-IMM is captured with a Gaussian mixture model, the maneuver description for a single model can still be multi-modal. Since the IMM filter predicts a multi-modal distribution in form of a combination of the uni-modal model specific prediction, in the presented results the RNN-IMM is set to also only predict conditioned on a single maneuver class a uni-modal Gaussian distribution.
The model has been implemented using Tensorflow  and is trained for epochs using ADAM optimizer  with a decreasing learning rate, starting from a 0.01 with a learning rate decay of 0.95 and a decay factor of
. For the experiments, the RNN variant Long Short-Term Memory (LSTM) is used.
In figure 3, predictions for two different preformed motion types are depicted for future positions weighted by the predicted maneuver probability. In the shown images the positions are normalized to start at the origin. The resulting multi-modal prediction is visualized as a heatmap. On the right, it can be seen that for a crossing sequence with straight walking the RNN-IMM mainly uses the corresponding straight walking model. On the left, where the deceleration started, the straight walking probability is visibly lower and the predicted distribution maximum is very close to the last observation. For the quantitative evaluation, noisy trajectories have been synthetically generated, where percent are used for training and percent for the comparison to the recursive Bayesian filters. The results are summarized in table 1.
|Approach||FDE [m]||[m]||FDE [m]||[m]||FDE [m]||[m]|
|IMM filter (CV, CA)||0.0674||0.0602||0.1188||0.1255||0.1862||0.1915|
|Kalman filter (CA)||0.0796||0.0638||0.1575||0.1137||0.2386||0.1696|
|Kalman filter (CV)||0.1578||0.1601||0.2890||0.2965||0.4701||0.4700|
The performance is compared with the final displacement error (FDE) (see for example ) of the lateral motion (from the vehicle perspective) for three different time horizons, in particular steps (), steps () and steps ( second). These results show that the presented RNN-IMM is able to faster capture the changing varying dynamic for the synthetic generated data. In terms of the single motion models (CV vs. CA), one can observe the benefits for the CA in capturing the deceleration. The IMM filter combines both and shows an improvement. Hence, the aim of this paper is more on highlighting the relation between traditional multiple model approaches and the suggested RNN-based IMM filter surrogate, it should be mentioned that RNN-based approaches are designed to receive input data for every time step, whereas Bayesian filters are well suited for handling missing observations. Especially with such a short initialization time, this can be crucial. One argument towards a learning based RNN-IMM is that we only choose the maneuver definition based on deviation of standard straight walking. The engineering task of finding the best model set up for IMM filters and their extensions can lead to an improved behavior (see for example Keller et al. ) in specific maneuver situations, but is also very tedious to find a good setting. It should also be mentioned that recent work like the approaches of Kooij et al.  show options how to further improve the prediction performance by including scene context and using more cues than pedestrian point kinematics (e.g. head orientation, gaze, body tilt, articulated body information.
However, the presented RNN-IMM is able to also provide a confidence value for the performed dynamic, but to avoid modeling the dynamic transitions with a fixed transition probability matrix . Similar to the provided mode probabilities of IMM filters, this can be used for further processing steps or rather applications (see for example [25, 8]). Further, instead of choosing the basic filter set, the prediction model is learned. In case there exists some well known model for describing the standard dynamic of the desired target, only deviations from the known dynamic can be used to define additional maneuver classes. This study on synthetically generated data shows, that by exploiting the connections between different views on maneuver prediction some perspectives on overcoming respective limitations can be gained.
In this paper, an RNN-encoder-decoder model, which can be interpreted as an IMM filter surrogate, has been presented. The RNN-IMM is able to jointly predict specific dynamic probabilities and corresponding distributions of future pedestrian trajectory. The model capabilities were shown on synthetic data that where reflecting typical pedestrian maneuvers. By conditioning on specific dynamic models or rather deviation of standard behavior, the model makes it possible to generate additional information in terms of an assigned maneuver probability similar to an IMM filter, but reduces the amount of explicit modeling of filter parameters (e.g. the dynamic transitions matrix). Thus, the presented RNN-IMM helps to reduce the amount of hard-coded engineering of traditional multiple model filter such as the IMM filter.
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: Large-scale machine learning on heterogeneous systems (2015),https://www.tensorflow.org/, software available from tensorflow.org
Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., Savarese, S.: Social LSTM: Human trajectory prediction in crowded spaces. In: Conference on Computer Vision and Pattern Recognition (CVPR). pp. 961–971. IEEE (2016)
-  Alahi, A., Ramanathan, V., Goel, K., Robicquet, A., Sadeghian, A., Fei-Fei, L., Savarese, S.: Learning to predict human behaviour in crowded scenes. In: Group and Crowd Behavior for Computer Vision. Elsevier (2017)
-  Arulampalam, M., Maskell, S., Gordon, N., Clapp, T.: A tutorial on particle filters for online nonlinear/non-gaussian bayesian tracking. Transactions on Signal Processing 50(2), 174–188 (2002). https://doi.org/10.1109/78.978374
-  Bar-Shalom, Y., Kirubarajan, T., Li, X.R.: Estimation with Applications to Tracking and Navigation. John Wiley & Sons, Inc., New York, NY, USA (2002)
-  Becker, S., Hübner, W., Arens, M.: State estimation for tracking in image space with a de- and re-coupled imm filter. Multimedia Tools and Applications 77(15), 20207–20226 (2018). https://doi.org/10.1007/s11042-017-5324-3
-  Becker, S., Hug, R., Hübner, W., Arens, M.: Red: A simple but effective baseline predictor for the trajnet benchmark. In: European Conference on Computer Vision (ECCV) Workshops. Springer International Publishing (2018)
-  Becker, S., Münch, D., Kieritz, H., Hübner, W., Arens, M.: Detecting abandoned objects using interacting multiple models. In: Proc. SPIE Volume 9652 Optics and Photonics for Counterterrorism, Crime Fighting, and Defence (2015)
-  Blom, H., Bar-Shalom, Y.: The interacting multiple model algorithm for systems with markovian switching coefficients. Transactions on Automatic Control 33(8), 780–783 (1988). https://doi.org/10.1109/9.1299
-  Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A., Bengio, Y.: A recurrent latent variable model for sequential data. In: Advances in Neural Information Processing Systems (NIPS) (2015)
-  Deo, N., Trivedi, M.M.: Multi-modal trajectory prediction of surrounding vehicles with maneuver based lstms. In: Intelligent Vehicles Symposium (IV). pp. 1179–1184. IEEE (2018). https://doi.org/10.1109/IVS.2018.8500493
-  Donahue, J., Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2015)
-  Graves, A., Mohamed, A., Hinton, G.: Speech recognition with deep recurrent neural networks. In: International Conference on Acoustics, Speech and Signal Processing. pp. 6645–6649 (2013). https://doi.org/10.1109/ICASSP.2013.6638947
-  Hochreiter, S., Schmidhuber, J.: Long Short-Term Memory. Neural Computation 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.19188.8.131.525
-  Hug, R., Becker, S., Hübner, W., Arens, M.: On the reliability of lstm-mdl models for predicting pedestrian trajectories. In: International Workshop on Representations, Analysis and Recognition of Shape and Motion from Imaging Data (RFMI). Savoie, France (2017)
-  Hug, R., Becker, S., Hübner, W., Arens, M.: Particle-based pedestrian path prediction using lstm-mdl models. In: 21st International Conference on Intelligent Transportation Systems (ITSC). pp. 2684–2691 (2018). https://doi.org/10.1109/ITSC.2018.8569478
-  Kalman, R.E.: A new approach to linear filtering and prediction problems. ASME Journal of Basic Engineering 82 (1960)
-  Keller, C.G., Hermes, C., Gavrila, D.M.: Will the pedestrian cross? probabilistic path prediction based on learned motion features. In: Mester, R., Felsberg, M. (eds.) Pattern Recognition. pp. 386–395. Springer Berlin Heidelberg (2011)
-  Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. International Conference for Learning Representations (ICLR) (2015)
-  Kooij, J.F.P., Flohr, F., Pool, E.A.I., Gavrila, D.M.: Context-based path prediction for targets with switching dynamics. International Journal of Computer Vision (Jul 2018). https://doi.org/10.1007/s11263-018-1104-4
-  Kooij, J.F.P., Schneider, N., Flohr, F., Gavrila, D.M.: Context-based pedestrian path prediction. In: European Conference on Computer Vision (ECCV). pp. 618–633. Springer International Publishing (2014). https://doi.org/10.1007/978-3-319-10599-4_40
-  Pellegrini, S., Ess, A., Schindler, K., van Gool, L.: You’ll never walk alone: Modeling social behavior for multi-target tracking. In: International Conference on Computer Vision (ICCV). pp. 261–268. IEEE (2009). https://doi.org/10.1109/ICCV.2009.5459260
-  Särkkä, S.: Bayesian Filtering and Smoothing. Institute of Mathematical Statistics Textbooks, Cambridge University Press (2013). https://doi.org/10.1017/CBO9781139344203
-  Schneider, N., Gavrila, D.M.: Pedestrian path prediction with recursive bayesian filters: A comparative study. In: German Conference on Pattern Recognition (GCPR). pp. 174–183. Springer Berlin Heidelberg (2013), https://doi.org/10.1007/978-3-642-40602-7_18
-  Stierlin, S., Dietmayer, K.: Scale change and ttc filter for longitudinal vehicle control based on monocular video. In: Intelligent Transportation Systems (ITSC), 2012 15th International IEEE Conference on. pp. 528–533 (2012). https://doi.org/10.1109/ITSC.2012.6338681
-  Teknom, K.: Microscopic Pedestrian Flow Characteristics: Development of an Image Processing Data Collection and Simulation Model. Ph.D. thesis, Tohoku University (2002)
-  Xu, K., Ba, J., Kiros, R., K.Cho, Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: International Conference on Machine Learning (ICML). vol. 37, pp. 2048–2057. PMLR, Lille, France (2015)