1 Introduction
Many practical applications can be represented as a sequence of various behaviours. These include detecting bull and bear financial markets [24], gesture recognition [23], animal behaviour recognition [20], and aircraft manoeuvres in military applications [19]. Detection time in such applications is often crucial. When data arrives sequentially, detection time becomes a problem of how many sequential samples the model requires to make an accurate prediction.
In this study, we consider the problem of pedestrian behaviour prediction or intent estimation. A review on the prediction of pedestrian behaviour in urban scenarios is presented in
[28]. A large portion of the literature has been devoted to tracking and path prediction. Many studies use the SLDS as a framework [31], [17], [18], and [16]. Recently, the recurrent neural network (RNN) has been shown to be a promising approach
[14, 29, 30, 4]. Owing to the significant advancement of the stateoftheart in pedestrian detection [21, 10, 12, 3], we assume that the trajectories of the pedestrians are known in this study. Given the pedestrian trajectories, we predict a particular behavioural class.There are studies have considered the problem of pedestrian behaviour prediction. Probabilistic models such as the latent dynamic conditional random field [32] and balanced Gaussian process dynamical models [25] have been applied. Various forms of the RNN have been also been considered. Hoy et. al. [13] propose a variational RNN which performs both tracking and behaviour prediction. Völz et. al. [33]
compare neural networks, a support vector machine, and the LSTM. Though both statistical and machine learning models have been applied to the problem and some studies consider timetoevent analyses, to our knowledge, no specific analyses between these model types in terms of timetodetection have been considered in the literature.
Our contribution is a comparison between a SLDS and a multilayered bidirectional LSTM neural network in the context of timetodetection. This is performed by classifying various pedestrian behaviours from the raw motion tracks under varying sequence lengths. Through the comparison, we gain novel insight into a key difference between the models: though the neural network is more accurate than the SLDS overall, it requires 10 times as many sequential samples to achieve this accuracy. The SLDS is able to provide its most accurate classification within the first few samples of the sequence. This result is important in situations where early detection is imperative.
2 Switching Linear Dynamical System
The SLDS models a system that switches between various dynamical models. Each dynamical model is represented as a Linear Dynamic System (LDS) – which is widely associated with the Kalman filter. The SLDS has been extended in various ways, such as introducing variables representing behavioural context information. Such models have been applied to maritime piracy applications
[9, 8] and abalone poaching applications [6]. Linderman et. al. [22]extend the SLDS by allowing the switching state to depend on the latent state and exogenous inputs through a logistic regression. The SLDS has also been extended to include multiple LDSs over several sequences under a single switching state
[7, 5].The graphical model representation of the SLDS is illustrated in Figure 1. The model comprises a switching state variable , a hidden or latent variable and a visible or observable variable at time . The latent variable and observable variable form a LDS, where the subscript
denotes the joint random variable over all discrete time instances 1 to T.. The switching state variable provides the means to switch between various dynamical models. The continuous dynamics of the system are represented by a linearGaussian state space model. The following equations describe the system
[2, 27](1)  
(2) 
Equation (1) describes the transition model and (2) describes the emission model. The matrix is the state matrix and
is the measurement matrix. With the Gaussian assumption, the noise components are modelled as white noise such that
and . All the LDS model parameters are conditionally dependent on at time . This provides the means to define different dynamic models for each switching state.The joint distribution describing the SLDS is given by:
(3) 
The switching state transition probability
is a discrete distribution. It describes how the model switches between various states. The state transition distribution and emission distribution are assumed to be Gaussian. These describe the dynamics of the system through the linear state space equations.Inference in the SLDS involves inferring the latent variables and given the observations . This is typically performed using filtering and smoothing methods. The filtering operation computes the filtered posterior . The smoothing operation computes the smoothed posterior . Exact inference in the SLDS is intractable [2, 27]. Approximate inference algorithms such as the Generalised Pseudo Bayesian (GPB) algorithm [26] and the Gaussian Sum Smoothing (GSS) algorithm [1] have been developed for the SLDS. In this study, the GPB algorithm is used.
Parameter learning in the SLDS can be performed using the Expectation Maximisation (EM) algorithm [26]. In the expectation step, a smoothing algorithm such as GPB can be used. In the maximisation step, the parameters are estimated using maximum likelihood.
3 MultiLayered Bidirectional LSTM
A threelayered bidirectional LSTM [11] RNN is constructed for comparison with the SLDS. The model is illustrated in Figure 2. Each LSTM layer comprises two sequences of LSTM cells; one propagating in the positive time direction and one in the negative time direction. Together, the forward and backward sequences form a bidirectional LSTM (BiLSTM). The bidirectional structure provides a means to make a prediction at time according to the full sequence , . This is in comparison to a unidirectional structure, which makes a prediction at time according to the sequence
. Three BiLSTMs are stacked to form three distinct layers. Multiple layers provide a deep structure which promotes higher level feature extraction. Input data is provided to the inputs of the first BiLSTM layer. For each sequence step, the outputs of the third BiLSTM layer are passed through a softmax layer. The softmax outputs the predicted class associated with the current input sample. For notational simplicity, this model is referred to as the RNN in the remainder of the discussion.
4 Dataset
The wellknown Daimler Pedestrian Path Prediction Benchmark Dataset (GCPR’13) [31] is used in this study. The dataset comprises a collection of 68 pedestrian sequences with 4 different pedestrian behaviour types: crossing, stopping, starting to walk, and bendingin. Though the dataset seems relatively small, the LSTM has been shown to perform well in the path prediction application [29].
The dataset was acquired using stereo cameras. The stereo camera provides the means to produce three dimensional Cartesian coordinates for tracking purposes. The ground truth of the dataset provides bounding boxes, disparity, and and coordinates of each target.
The dataset was developed to test recursive Bayesian filters for pedestrian path prediction for different behaviours [31]. In this study, this problem is inverted. The behaviour of a pedestrian is inferred from the tracked path.
5 Methodology
The dataset is provided with a predefined training set and a test set. The training and test sets comprise 36 and 32 sequences respectively. The model parameters are estimated using the training dataset. The trained models are then applied to predict the behaviour class from the tracks provided in the test dataset. To measure the performance of the models, accuracy, precision and recall are used.
The models are tested on sequences of varying length. This is achieved by truncating the sequences in increments of 10 samples. That is, the models are tested on the first samples of each sequence in the test set. The classification results are stored for each sequence length. Limiting the number of timesteps provides an indication of how well the method is able to predict a behaviour class in a short period of time. Furthermore, it provides some form of consistency over the varying sequence lengths in the dataset.
The SLDS motion model is configured as a constant acceleration model. The tracked coordinates are provided as observations to the SLDS. The model parameters are learned using the EM algorithm. The switching state is defined to comprise the states BendingIn, Crossing, Starting, and Stopping. In the dataset, the pedestrians do not switch between behaviour classes over the sequences. The switching state transition distribution is set with a probability of remaining in the current switching state and a
probability of transitioning to one of the other three switching states. The prior switching state probability distribution is set to the uniform distribution.
The RNN is configured with 32 hidden units in each LSTM cell. The ADAM algorithm [15]
is used to minimise the cross entropy of the softmax outputs. The model is trained over 110 epochs with a learning rate of 0.0001 and a batch size of 1. The remaining ADAM parameters are set as recommended in
[15]. The RNN is trained over the complete length of each sequence in the test set.6 Results
The accuracy over the set of truncated sequences is presented in Figure 3. The striking feature in the plot is that the RNN increases in accuracy with increasing sequence length, whereas the SLDS decreases in accuracy with increasing sequence length. The SLDS has the highest accuracy with a sequence of 10 samples. This implies that within the first 10 samples, the SLDS is able to classify the sequence. The RNN’s accuracy curve saturates around the 100 sample length mark. This indicates that the RNN requires a sequence of at least 100 samples to achieve the high accuracy.
These results are consistent with theoretical design of the models. The SLDS assumes a first order Markov model in both the dynamics and the switching state. A first order Markov model assumes that the current state is conditionally dependent
only on the previous state. The result is that the SLDS is not designed to model longterm dependencies in the data. The SLDS thus performs better when provided with the shorter sequences. Furthermore, the SLDS performance decreases with sequence length as it is designed to switch between dynamics. It is more likely to switch behaviour class in a longer sequence. The LSTM cell in the RNN has been specifically designed to model both long and shortterm dependencies in the data [11]. The result is that the RNN requires a longer sequence to achieve a higher accuracy. Another relevant difference between the models is that the SLDS is a structured model where the dynamics have been predefined. In the RNN, the dynamics are learned in a blackbox approach, which often requires more data.The precision and recall over the set of truncated sequences are presented in Figure 4 and Figure 5 respectively. Confusion matrices for the 10samplelength and complete sequences are presented in Table 1. Precision is often viewed as a measure of the quality of the model. Recall describes the probability of correctly classifying the pedestrian behaviour.
As for the RNN accuracy, the precision and recall values are only high for sequences with 100 samples or more. The precision and recall for the SLDS are highest for sequences of 10 steps.
The RNN generally has a higher precision and recall than the SLDS. The RNN however struggles to correctly predict the starting behaviour class. Consider the confusion matrix for the complete sequence classification presented in Table
1. The majority of Starting samples are incorrectly associated with the BendingIn class. A possible reason for this is that the sequences for the Starting class are generally short in length. The poor results for the Starting class lowers the overall accuracy of the RNN.10 samples  Complete sequence  

SLDS  
RNN 
The lowest recall is for the SLDS model is the BendingIn class, with a value of . Considering the confusion matrix, the of the samples were misclassified as starting behaviour. The model performs well on the crossing and starting classes. For longer sequences, the precision and recall for the Stopping class decreases significantly. As also indicated in Figure 5, the recall for the Crossing and Starting classes remain fairly constant.
For the RNN with 10sample sequences, of the BendingIn samples were incorrectly associated with the Crossing class as indicated in Table 1. When provided with the complete sequence, this reduces to . Similarly, most of the Starting samples are incorrectly associated with the Crossing class with short sequences. When provided with the complete sequence, the incorrect classifications shift to the BendingIn class.
A plot of the track and the class predictions for test sequence 0 is presented in Figure 6. The track of the pedestrian performing the BendingIn activity is presented in Figure (a)a. The horizontal axis represents the depth dimension, with respect to the camera. The vertical axis represents the camera’s horizontal axis, . Note that the time aspect of the track is not represented in this plot. The plot of the predicted switching state over time is presented in Figure (b)b. Dark grey indicates a high probability of that the pedestrian belongs to a particular class. Light grey indicates a low probability of that the pedestrian behaviour belongs to the particular class. Both the SLDS and the RNN associate the behaviour with the BendingIn class for the first 160 time steps. The predictions subsequently transition to the Starting class. This may be explained by the fact that the pedestrian seems to backtrack as illustrated in Figure (a)a.
Figure 7 illustrates an example of the Starting behaviour class. The SLDS successfully predicts the correct class for the entire sequence. The RNN incorrectly predicts the BendingIn class, but does associate some probability with the Starting class. This result corresponds the completesequence confusion matrix presented in Table 1.
Figure 8 illustrates results for the Crossing behaviour class. The track in Figure (b)b is approximately linear over the space. With such behaviour, both models generally perform well in this class.
An example of the Stopping behaviour class is presented in Figure 9. The SLDS correctly begins by classifying the stopping behaviour class and then transitions to the crossing class. This result corresponds to the complete sequence confusion matrix presented in Table 1. The RNN correctly classifies the stopping class for the entire sequence. This corresponds to the high recall for this class as illustrated in Figure 5.
7 Summary and Conclusion
In this study a SLDS and a threelayered bidirectional LSTM RNN are applied to predict pedestrian behaviour from motion tracks from the Daimler Pedestrian Path Prediction Benchmark Dataset (GCPR’13). The key result is that the RNN model’s accuracy increases with increasing sequence length, whereas the SLDS’s accuracy decreases with increasing sequence length. The best results for the SLDS are obtained when the first 10 samples of the sequence are provided to the model. This is possibly due to the SLDS being designed to model shortterm behaviour as well as having a predefined model of the dynamics. The RNN is designed to model both short and longterm dynamics with a blackbox approach. The result is that the RNN is more accurate, but over longer sequences (100 samples or more). This suggests that in situations where a decision is required to be made quickly, the SLDS may be the preferred model.
There is potential for improvement of the results for both models. One approach would be to include contextual information. This can be achieved in the SLDS using methods such as those described in [9, 8, 6]. Contextual information could include road signs, proximity to crossing areas, and traffic congestion levels. Additional information relating to the urban environment could also be influential. For example, a street may be residential or commercial.
References
 [1] (2006) Expectation correction for smoothed inference in switching linear dynamical systems. The Journal of Machine Learning Research 7, pp. 2515–2540. Cited by: §2.
 [2] (2012) Bayesian reasoning and machine learning. Cambridge University Press. External Links: ISBN 9780521518147, LCCN 2011035553 Cited by: §2, §2.
 [3] (201512) Learning complexityaware cascades for deep pedestrian detection. Cited by: §1.
 [4] (2018) Pedestrian trajectory prediction via the socialgrid lstm model. The Journal of Engineering 2018 (16), pp. 1468–1474. External Links: Document, ISSN 20513305 Cited by: §1.

[5]
(2016)
Systemic banking crisis early warning systems using dynamic bayesian networks
. Expert Systems with Applications 62, pp. 225 – 242. External Links: ISSN 09574174, Document Cited by: §2. 
[6]
(2017)
Contextbased behaviour modelling and classification of marine vessels in an abalone poaching situation.
Engineering Applications of Artificial Intelligence
64, pp. 95 – 111. External Links: ISSN 09521976, Document Cited by: §2, §7.  [7] (2018) Naive bayes switching linear dynamical system: a model for dynamic system modelling, classification, and information fusion. Information Fusion 42, pp. 75 – 101. External Links: ISSN 15662535, Document Cited by: §2.
 [8] (2015) A unified model for contextbased behavioural modelling and classification. Expert Systems with Applications 42 (19), pp. 6738 – 6757. External Links: ISSN 09574174, Document Cited by: §2, §7.
 [9] (2015) Maritime piracy situation modelling with dynamic bayesian networks. Information Fusion 23, pp. 116 – 130. External Links: ISSN 15662535, Document Cited by: §2, §7.
 [10] (201703) Fused dnn: a deep neural network fusion approach to fast and robust pedestrian detection. pp. 953–961. External Links: Document, ISSN Cited by: §1.
 [11] (1997) Long shortterm memory. Neural Computation 9 (8), pp. 1735–1780. External Links: Document, Link, https://doi.org/10.1162/neco.1997.9.8.1735 Cited by: §3, §6.
 [12] (201506) Taking a deeper look at pedestrians. Cited by: §1.
 [13] (201811) Learning to predict pedestrian intention via variational tracking networks. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Vol. , pp. 3132–3137. External Links: Document, ISSN 21530017 Cited by: §1.
 [14] (201811) Particlebased pedestrian path prediction using lstmmdl models. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Vol. , pp. 2684–2691. External Links: Document, ISSN 21530017 Cited by: §1.
 [15] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.
 [16] (201602) Mixture of switching linear dynamics to discover behavior patterns in object tracks. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (2), pp. 322–334. External Links: Document, ISSN 01628828 Cited by: §1.
 [17] (201406) Analysis of pedestrian dynamics from a vehicle perspective. In 2014 IEEE Intelligent Vehicles Symposium Proceedings, Vol. , pp. 1445–1450. External Links: Document, ISSN 19310587 Cited by: §1.

[18]
(2014)
Contextbased pedestrian path prediction.
In
European Conference on Computer Vision
, pp. 618–633. Cited by: §1. 
[19]
(2017)
Threat evaluation of enemy air fighters via neural networkbased markov chain modeling
. KnowledgeBased Systems 116, pp. 49–57. Cited by: §1. 
[20]
(2017)
Analysis of animal accelerometer data using hidden markov models
. Methods in Ecology and Evolution 8 (2), pp. 161–173. Cited by: §1.  [21] (201804) Scaleaware fast rcnn for pedestrian detection. IEEE Transactions on Multimedia 20 (4), pp. 985–996. External Links: Document, ISSN 15209210 Cited by: §1.
 [22] (201720–22 Apr) Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, A. Singh and J. Zhu (Eds.), Proceedings of Machine Learning Research, Vol. 54, Fort Lauderdale, FL, USA, pp. 914–922. Cited by: §2.
 [23] (2014) Fusion of inertial and depth sensor data for robust hand gesture recognition. IEEE Sensors Journal 14 (6), pp. 1898–1903. Cited by: §1.
 [24] (2009) Extracting bull and bear markets from stock returns. University of Toronto and CIRANO working paper.[Online] Available at https://www. economics. utoronto. ca/public/workingPapers/tecipa369. pdf. Cited by: §1.
 [25] (2018) Pedestrian path, pose, and intention prediction through gaussian process dynamical models and pedestrian activity recognition. IEEE Transactions on Intelligent Transportation Systems (), pp. 1–12. External Links: Document, ISSN 15249050 Cited by: §1.
 [26] (1998) Switching kalman filters. Technical report Department of Computer Science, UC Berkeley. Cited by: §2, §2.
 [27] (2012) Machine learning: a probabilistic perspective. MIT press. Cited by: §2, §2.
 [28] (201811) A literature review on the prediction of pedestrian behavior in urban scenarios. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Vol. , pp. 3105–3112. External Links: Document, ISSN 21530017 Cited by: §1.
 [29] (201812) Intent prediction of pedestrians via motion trajectories using stacked recurrent neural networks. IEEE Transactions on Intelligent Vehicles 3 (4), pp. 414–424. External Links: Document, ISSN 23798904 Cited by: §1, §4.

[30]
(201812)
Longterm recurrent predictive model for intent prediction of pedestrians via inverse reinforcement learning
. In 2018 Digital Image Computing: Techniques and Applications (DICTA), Vol. , pp. 1–8. External Links: Document, ISSN Cited by: §1.  [31] (2013) Pedestrian path prediction with recursive bayesian filters: a comparative study. In Pattern Recognition, J. Weickert, M. Hein, and B. Schiele (Eds.), Berlin, Heidelberg, pp. 174–183. External Links: ISBN 9783642406027 Cited by: §1, §4, §4.
 [32] (201506) Pedestrian intention recognition using latentdynamic conditional random fields. In 2015 IEEE Intelligent Vehicles Symposium (IV), Vol. , pp. 622–627. External Links: Document, ISSN 19310587 Cited by: §1.
 [33] (201611) A datadriven approach for pedestrian intention estimation. In 2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC)The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)The IEEE International Conference on Computer Vision (ICCV)2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Vol. , pp. 2607–2612. Cited by: §1.
Comments
There are no comments yet.