Deep Learning and Statistical Models for Time-Critical Pedestrian Behaviour Prediction

02/26/2020 ∙ by Joel Janek Dabrowski, et al. ∙ CSIRO University of Pretoria 0

The time it takes for a classifier to make an accurate prediction can be crucial in many behaviour recognition problems. For example, an autonomous vehicle should detect hazardous pedestrian behaviour early enough for it to take appropriate measures. In this context, we compare the switching linear dynamical system (SLDS) and a three-layered bi-directional long short-term memory (LSTM) neural network, which are applied to infer pedestrian behaviour from motion tracks. We show that, though the neural network model achieves an accuracy of 80 more). The SLDS, has a lower accuracy of 74 short sequences (10 samples). To our knowledge, such a comparison on sequence length has not been considered in the literature before. The results provide a key intuition of the suitability of the models in time-critical problems.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many practical applications can be represented as a sequence of various behaviours. These include detecting bull and bear financial markets [24], gesture recognition [23], animal behaviour recognition [20], and aircraft manoeuvres in military applications [19]. Detection time in such applications is often crucial. When data arrives sequentially, detection time becomes a problem of how many sequential samples the model requires to make an accurate prediction.

In this study, we consider the problem of pedestrian behaviour prediction or intent estimation. A review on the prediction of pedestrian behaviour in urban scenarios is presented in

[28]. A large portion of the literature has been devoted to tracking and path prediction. Many studies use the SLDS as a framework [31], [17], [18], and [16]

. Recently, the recurrent neural network (RNN) has been shown to be a promising approach

[14, 29, 30, 4]. Owing to the significant advancement of the state-of-the-art in pedestrian detection [21, 10, 12, 3], we assume that the trajectories of the pedestrians are known in this study. Given the pedestrian trajectories, we predict a particular behavioural class.

There are studies have considered the problem of pedestrian behaviour prediction. Probabilistic models such as the latent dynamic conditional random field [32] and balanced Gaussian process dynamical models [25] have been applied. Various forms of the RNN have been also been considered. Hoy et. al. [13] propose a variational RNN which performs both tracking and behaviour prediction. Völz et. al. [33]

compare neural networks, a support vector machine, and the LSTM. Though both statistical and machine learning models have been applied to the problem and some studies consider time-to-event analyses, to our knowledge, no specific analyses between these model types in terms of time-to-detection have been considered in the literature.

Our contribution is a comparison between a SLDS and a multi-layered bi-directional LSTM neural network in the context of time-to-detection. This is performed by classifying various pedestrian behaviours from the raw motion tracks under varying sequence lengths. Through the comparison, we gain novel insight into a key difference between the models: though the neural network is more accurate than the SLDS overall, it requires 10 times as many sequential samples to achieve this accuracy. The SLDS is able to provide its most accurate classification within the first few samples of the sequence. This result is important in situations where early detection is imperative.

2 Switching Linear Dynamical System

The SLDS models a system that switches between various dynamical models. Each dynamical model is represented as a Linear Dynamic System (LDS) – which is widely associated with the Kalman filter. The SLDS has been extended in various ways, such as introducing variables representing behavioural context information. Such models have been applied to maritime piracy applications

[9, 8] and abalone poaching applications [6]. Linderman et. al. [22]

extend the SLDS by allowing the switching state to depend on the latent state and exogenous inputs through a logistic regression. The SLDS has also been extended to include multiple LDSs over several sequences under a single switching state

[7, 5].

The graphical model representation of the SLDS is illustrated in Figure 1. The model comprises a switching state variable , a hidden or latent variable and a visible or observable variable at time . The latent variable and observable variable form a LDS, where the subscript

denotes the joint random variable over all discrete time instances 1 to T.. The switching state variable provides the means to switch between various dynamical models. The continuous dynamics of the system are represented by a linear-Gaussian state space model. The following equations describe the system

[2, 27]


Equation (1) describes the transition model and (2) describes the emission model. The matrix is the state matrix and

is the measurement matrix. With the Gaussian assumption, the noise components are modelled as white noise such that

and . All the LDS model parameters are conditionally dependent on at time . This provides the means to define different dynamic models for each switching state.

Switching state

Figure 1: The graphical model of the switching linear dynamical system (SLDS).

The joint distribution describing the SLDS is given by:


The switching state transition probability

is a discrete distribution. It describes how the model switches between various states. The state transition distribution and emission distribution are assumed to be Gaussian. These describe the dynamics of the system through the linear state space equations.

Inference in the SLDS involves inferring the latent variables and given the observations . This is typically performed using filtering and smoothing methods. The filtering operation computes the filtered posterior . The smoothing operation computes the smoothed posterior . Exact inference in the SLDS is intractable [2, 27]. Approximate inference algorithms such as the Generalised Pseudo Bayesian (GPB) algorithm [26] and the Gaussian Sum Smoothing (GSS) algorithm [1] have been developed for the SLDS. In this study, the GPB algorithm is used.

Parameter learning in the SLDS can be performed using the Expectation Maximisation (EM) algorithm [26]. In the expectation step, a smoothing algorithm such as GPB can be used. In the maximisation step, the parameters are estimated using maximum likelihood.

3 Multi-Layered Bidirectional LSTM

A three-layered bi-directional LSTM [11] RNN is constructed for comparison with the SLDS. The model is illustrated in Figure 2. Each LSTM layer comprises two sequences of LSTM cells; one propagating in the positive time direction and one in the negative time direction. Together, the forward and backward sequences form a bi-directional LSTM (BiLSTM). The bi-directional structure provides a means to make a prediction at time according to the full sequence , . This is in comparison to a uni-directional structure, which makes a prediction at time according to the sequence

. Three BiLSTMs are stacked to form three distinct layers. Multiple layers provide a deep structure which promotes higher level feature extraction. Input data is provided to the inputs of the first BiLSTM layer. For each sequence step, the outputs of the third BiLSTM layer are passed through a softmax layer. The softmax outputs the predicted class associated with the current input sample. For notational simplicity, this model is referred to as the RNN in the remainder of the discussion.



















Layer 1

Layer 2

Layer 3
Figure 2: Three-layered bi-directional LSTM architecture. Each rectangular node denotes an LSTM cell. Round nodes denote softmax layers. The edges denote connectivity between the LSTM cells and output layer. At time , pedestrian tracks are denoted by and the behaviour class is denoted by .

4 Dataset

The well-known Daimler Pedestrian Path Prediction Benchmark Dataset (GCPR’13) [31] is used in this study. The dataset comprises a collection of 68 pedestrian sequences with 4 different pedestrian behaviour types: crossing, stopping, starting to walk, and bending-in. Though the dataset seems relatively small, the LSTM has been shown to perform well in the path prediction application [29].

The dataset was acquired using stereo cameras. The stereo camera provides the means to produce three dimensional Cartesian coordinates for tracking purposes. The ground truth of the dataset provides bounding boxes, disparity, and and coordinates of each target.

The dataset was developed to test recursive Bayesian filters for pedestrian path prediction for different behaviours [31]. In this study, this problem is inverted. The behaviour of a pedestrian is inferred from the tracked path.

5 Methodology

The dataset is provided with a predefined training set and a test set. The training and test sets comprise 36 and 32 sequences respectively. The model parameters are estimated using the training dataset. The trained models are then applied to predict the behaviour class from the tracks provided in the test dataset. To measure the performance of the models, accuracy, precision and recall are used.

The models are tested on sequences of varying length. This is achieved by truncating the sequences in increments of 10 samples. That is, the models are tested on the first samples of each sequence in the test set. The classification results are stored for each sequence length. Limiting the number of timesteps provides an indication of how well the method is able to predict a behaviour class in a short period of time. Furthermore, it provides some form of consistency over the varying sequence lengths in the dataset.

The SLDS motion model is configured as a constant acceleration model. The tracked coordinates are provided as observations to the SLDS. The model parameters are learned using the EM algorithm. The switching state is defined to comprise the states BendingIn, Crossing, Starting, and Stopping. In the dataset, the pedestrians do not switch between behaviour classes over the sequences. The switching state transition distribution is set with a probability of remaining in the current switching state and a

probability of transitioning to one of the other three switching states. The prior switching state probability distribution is set to the uniform distribution.

The RNN is configured with 32 hidden units in each LSTM cell. The ADAM algorithm [15]

is used to minimise the cross entropy of the softmax outputs. The model is trained over 110 epochs with a learning rate of 0.0001 and a batch size of 1. The remaining ADAM parameters are set as recommended in

[15]. The RNN is trained over the complete length of each sequence in the test set.

6 Results

Figure 3: Accuracy over the set of truncated sequences.

The accuracy over the set of truncated sequences is presented in Figure 3. The striking feature in the plot is that the RNN increases in accuracy with increasing sequence length, whereas the SLDS decreases in accuracy with increasing sequence length. The SLDS has the highest accuracy with a sequence of 10 samples. This implies that within the first 10 samples, the SLDS is able to classify the sequence. The RNN’s accuracy curve saturates around the 100 sample length mark. This indicates that the RNN requires a sequence of at least 100 samples to achieve the high accuracy.

These results are consistent with theoretical design of the models. The SLDS assumes a first order Markov model in both the dynamics and the switching state. A first order Markov model assumes that the current state is conditionally dependent

only on the previous state. The result is that the SLDS is not designed to model long-term dependencies in the data. The SLDS thus performs better when provided with the shorter sequences. Furthermore, the SLDS performance decreases with sequence length as it is designed to switch between dynamics. It is more likely to switch behaviour class in a longer sequence. The LSTM cell in the RNN has been specifically designed to model both long and short-term dependencies in the data [11]. The result is that the RNN requires a longer sequence to achieve a higher accuracy. Another relevant difference between the models is that the SLDS is a structured model where the dynamics have been predefined. In the RNN, the dynamics are learned in a black-box approach, which often requires more data.

The precision and recall over the set of truncated sequences are presented in Figure 4 and Figure 5 respectively. Confusion matrices for the 10-sample-length and complete sequences are presented in Table 1. Precision is often viewed as a measure of the quality of the model. Recall describes the probability of correctly classifying the pedestrian behaviour.

As for the RNN accuracy, the precision and recall values are only high for sequences with 100 samples or more. The precision and recall for the SLDS are highest for sequences of 10 steps.

The RNN generally has a higher precision and recall than the SLDS. The RNN however struggles to correctly predict the starting behaviour class. Consider the confusion matrix for the complete sequence classification presented in Table

1. The majority of Starting samples are incorrectly associated with the BendingIn class. A possible reason for this is that the sequences for the Starting class are generally short in length. The poor results for the Starting class lowers the overall accuracy of the RNN.

Figure 4: Precision over the set of truncated sequences.
Figure 5: Recall over the set of truncated sequences.
10 samples Complete sequence
Table 1: Confusion matrices for the SLDS and RNN for the 10-sample-length and complete sequence predictions. The matrices are normalised over the rows to indicate a form of recall. Rows and columns follow the class order of BendingIn, Crossing, Starting, and Stopping.

The lowest recall is for the SLDS model is the BendingIn class, with a value of . Considering the confusion matrix, the of the samples were misclassified as starting behaviour. The model performs well on the crossing and starting classes. For longer sequences, the precision and recall for the Stopping class decreases significantly. As also indicated in Figure 5, the recall for the Crossing and Starting classes remain fairly constant.

For the RNN with 10-sample sequences, of the BendingIn samples were incorrectly associated with the Crossing class as indicated in Table 1. When provided with the complete sequence, this reduces to . Similarly, most of the Starting samples are incorrectly associated with the Crossing class with short sequences. When provided with the complete sequence, the incorrect classifications shift to the BendingIn class.

A plot of the track and the class predictions for test sequence 0 is presented in Figure 6. The track of the pedestrian performing the BendingIn activity is presented in Figure (a)a. The horizontal axis represents the depth dimension, with respect to the camera. The vertical axis represents the camera’s horizontal axis, . Note that the time aspect of the track is not represented in this plot. The plot of the predicted switching state over time is presented in Figure (b)b. Dark grey indicates a high probability of that the pedestrian belongs to a particular class. Light grey indicates a low probability of that the pedestrian behaviour belongs to the particular class. Both the SLDS and the RNN associate the behaviour with the BendingIn class for the first 160 time steps. The predictions subsequently transition to the Starting class. This may be explained by the fact that the pedestrian seems to back-track as illustrated in Figure (a)a.

(a) Pedestrian track..
(b) Behaviour prediction. (Horizontal axis: sequence samples).
Figure 6: Pedestrian track and behaviour prediction for test sequence 0. The true behaviour class is BendingIn.

Figure 7 illustrates an example of the Starting behaviour class. The SLDS successfully predicts the correct class for the entire sequence. The RNN incorrectly predicts the BendingIn class, but does associate some probability with the Starting class. This result corresponds the complete-sequence confusion matrix presented in Table 1.

Figure 7: Behaviour prediction for test sequence 11. The true behaviour class is Starting.

Figure 8 illustrates results for the Crossing behaviour class. The track in Figure (b)b is approximately linear over the space. With such behaviour, both models generally perform well in this class.

(a) Pedestrian track.
(b) Behaviour prediction. (Horizontal axis: sequence samples)
Figure 8: Pedestrian track and behaviour prediction for test sequence 12. The true behaviour class is ‘crossing’.

An example of the Stopping behaviour class is presented in Figure 9. The SLDS correctly begins by classifying the stopping behaviour class and then transitions to the crossing class. This result corresponds to the complete sequence confusion matrix presented in Table 1. The RNN correctly classifies the stopping class for the entire sequence. This corresponds to the high recall for this class as illustrated in Figure 5.

Figure 9: Behaviour prediction for test sequence 11. The true behaviour class is Stopping.

7 Summary and Conclusion

In this study a SLDS and a three-layered bidirectional LSTM RNN are applied to predict pedestrian behaviour from motion tracks from the Daimler Pedestrian Path Prediction Benchmark Dataset (GCPR’13). The key result is that the RNN model’s accuracy increases with increasing sequence length, whereas the SLDS’s accuracy decreases with increasing sequence length. The best results for the SLDS are obtained when the first 10 samples of the sequence are provided to the model. This is possibly due to the SLDS being designed to model short-term behaviour as well as having a predefined model of the dynamics. The RNN is designed to model both short and long-term dynamics with a black-box approach. The result is that the RNN is more accurate, but over longer sequences (100 samples or more). This suggests that in situations where a decision is required to be made quickly, the SLDS may be the preferred model.

There is potential for improvement of the results for both models. One approach would be to include contextual information. This can be achieved in the SLDS using methods such as those described in [9, 8, 6]. Contextual information could include road signs, proximity to crossing areas, and traffic congestion levels. Additional information relating to the urban environment could also be influential. For example, a street may be residential or commercial.


  • [1] D. Barber (2006) Expectation correction for smoothed inference in switching linear dynamical systems. The Journal of Machine Learning Research 7, pp. 2515–2540. Cited by: §2.
  • [2] D. Barber (2012) Bayesian reasoning and machine learning. Cambridge University Press. External Links: ISBN 9780521518147, LCCN 2011035553 Cited by: §2, §2.
  • [3] Z. Cai, M. Saberian, and N. Vasconcelos (2015-12) Learning complexity-aware cascades for deep pedestrian detection. Cited by: §1.
  • [4] B. Cheng, X. Xu, Y. Zeng, J. Ren, and S. Jung (2018) Pedestrian trajectory prediction via the social-grid lstm model. The Journal of Engineering 2018 (16), pp. 1468–1474. External Links: Document, ISSN 2051-3305 Cited by: §1.
  • [5] J. J. Dabrowski, C. Beyers, and J. P. de Villiers (2016)

    Systemic banking crisis early warning systems using dynamic bayesian networks

    Expert Systems with Applications 62, pp. 225 – 242. External Links: ISSN 0957-4174, Document Cited by: §2.
  • [6] J. J. Dabrowski, J. P. de Villiers, and C. Beyers (2017) Context-based behaviour modelling and classification of marine vessels in an abalone poaching situation.

    Engineering Applications of Artificial Intelligence

    64, pp. 95 – 111.
    External Links: ISSN 0952-1976, Document Cited by: §2, §7.
  • [7] J. J. Dabrowski, J. P. de Villiers, and C. Beyers (2018) Naive bayes switching linear dynamical system: a model for dynamic system modelling, classification, and information fusion. Information Fusion 42, pp. 75 – 101. External Links: ISSN 1566-2535, Document Cited by: §2.
  • [8] J. J. Dabrowski and J. P. de Villiers (2015) A unified model for context-based behavioural modelling and classification. Expert Systems with Applications 42 (19), pp. 6738 – 6757. External Links: ISSN 0957-4174, Document Cited by: §2, §7.
  • [9] J. J. Dabrowski and J. P. de Villiers (2015) Maritime piracy situation modelling with dynamic bayesian networks. Information Fusion 23, pp. 116 – 130. External Links: ISSN 1566-2535, Document Cited by: §2, §7.
  • [10] X. Du, M. El-Khamy, J. Lee, and L. Davis (2017-03) Fused dnn: a deep neural network fusion approach to fast and robust pedestrian detection. pp. 953–961. External Links: Document, ISSN Cited by: §1.
  • [11] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. External Links: Document, Link, Cited by: §3, §6.
  • [12] J. Hosang, M. Omran, R. Benenson, and B. Schiele (2015-06) Taking a deeper look at pedestrians. Cited by: §1.
  • [13] M. Hoy, Z. Tu, K. Dang, and J. Dauwels (2018-11) Learning to predict pedestrian intention via variational tracking networks. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Vol. , pp. 3132–3137. External Links: Document, ISSN 2153-0017 Cited by: §1.
  • [14] R. Hug, S. Becker, W. Hübner, and M. Arens (2018-11) Particle-based pedestrian path prediction using lstm-mdl models. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Vol. , pp. 2684–2691. External Links: Document, ISSN 2153-0017 Cited by: §1.
  • [15] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.
  • [16] J. F. P. Kooij, G. Englebienne, and D. M. Gavrila (2016-02) Mixture of switching linear dynamics to discover behavior patterns in object tracks. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (2), pp. 322–334. External Links: Document, ISSN 0162-8828 Cited by: §1.
  • [17] J. F. P. Kooij, N. Schneider, and D. M. Gavrila (2014-06) Analysis of pedestrian dynamics from a vehicle perspective. In 2014 IEEE Intelligent Vehicles Symposium Proceedings, Vol. , pp. 1445–1450. External Links: Document, ISSN 1931-0587 Cited by: §1.
  • [18] J. F. P. Kooij, N. Schneider, F. Flohr, and D. M. Gavrila (2014) Context-based pedestrian path prediction. In

    European Conference on Computer Vision

    pp. 618–633. Cited by: §1.
  • [19] H. Lee, B. J. Choi, C. O. Kim, J. S. Kim, and J. E. Kim (2017)

    Threat evaluation of enemy air fighters via neural network-based markov chain modeling

    Knowledge-Based Systems 116, pp. 49–57. Cited by: §1.
  • [20] V. Leos-Barajas, T. Photopoulou, R. Langrock, T. A. Patterson, Y. Y. Watanabe, M. Murgatroyd, and Y. P. Papastamatiou (2017)

    Analysis of animal accelerometer data using hidden markov models

    Methods in Ecology and Evolution 8 (2), pp. 161–173. Cited by: §1.
  • [21] J. Li, X. Liang, S. Shen, T. Xu, J. Feng, and S. Yan (2018-04) Scale-aware fast r-cnn for pedestrian detection. IEEE Transactions on Multimedia 20 (4), pp. 985–996. External Links: Document, ISSN 1520-9210 Cited by: §1.
  • [22] S. Linderman, M. Johnson, A. Miller, R. Adams, D. Blei, and L. Paninski (2017-20–22 Apr) Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, A. Singh and J. Zhu (Eds.), Proceedings of Machine Learning Research, Vol. 54, Fort Lauderdale, FL, USA, pp. 914–922. Cited by: §2.
  • [23] K. Liu, C. Chen, R. Jafari, and N. Kehtarnavaz (2014) Fusion of inertial and depth sensor data for robust hand gesture recognition. IEEE Sensors Journal 14 (6), pp. 1898–1903. Cited by: §1.
  • [24] J. M. Maheu, T. H. McCurdy, Y. Song, et al. (2009) Extracting bull and bear markets from stock returns. University of Toronto and CIRANO working paper.[Online] Available at https://www. economics. utoronto. ca/public/workingPapers/tecipa-369. pdf. Cited by: §1.
  • [25] R. Q. Minguez, I. P. Alonso, D. Fernandez-Llorca, and M. A. Sotelo (2018) Pedestrian path, pose, and intention prediction through gaussian process dynamical models and pedestrian activity recognition. IEEE Transactions on Intelligent Transportation Systems (), pp. 1–12. External Links: Document, ISSN 1524-9050 Cited by: §1.
  • [26] K. P. Murphy (1998) Switching kalman filters. Technical report Department of Computer Science, UC Berkeley. Cited by: §2, §2.
  • [27] K. P. Murphy (2012) Machine learning: a probabilistic perspective. MIT press. Cited by: §2, §2.
  • [28] D. Ridel, E. Rehder, M. Lauer, C. Stiller, and D. Wolf (2018-11) A literature review on the prediction of pedestrian behavior in urban scenarios. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Vol. , pp. 3105–3112. External Links: Document, ISSN 2153-0017 Cited by: §1.
  • [29] K. Saleh, M. Hossny, and S. Nahavandi (2018-12) Intent prediction of pedestrians via motion trajectories using stacked recurrent neural networks. IEEE Transactions on Intelligent Vehicles 3 (4), pp. 414–424. External Links: Document, ISSN 2379-8904 Cited by: §1, §4.
  • [30] K. Saleh, M. Hossny, and S. Nahavandi (2018-12)

    Long-term recurrent predictive model for intent prediction of pedestrians via inverse reinforcement learning

    In 2018 Digital Image Computing: Techniques and Applications (DICTA), Vol. , pp. 1–8. External Links: Document, ISSN Cited by: §1.
  • [31] N. Schneider and D. M. Gavrila (2013) Pedestrian path prediction with recursive bayesian filters: a comparative study. In Pattern Recognition, J. Weickert, M. Hein, and B. Schiele (Eds.), Berlin, Heidelberg, pp. 174–183. External Links: ISBN 978-3-642-40602-7 Cited by: §1, §4, §4.
  • [32] A. T. Schulz and R. Stiefelhagen (2015-06) Pedestrian intention recognition using latent-dynamic conditional random fields. In 2015 IEEE Intelligent Vehicles Symposium (IV), Vol. , pp. 622–627. External Links: Document, ISSN 1931-0587 Cited by: §1.
  • [33] B. Volz, K. Behrendt, H. Mielenz, I. Gilitschenski, R. Siegwart, and J. Nieto (2016-11) A data-driven approach for pedestrian intention estimation. In 2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC)The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)The IEEE International Conference on Computer Vision (ICCV)2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Vol. , pp. 2607–2612. Cited by: §1.