FuSSI-Net: Fusion of Spatio-temporal Skeletons for Intention Prediction Network

by   Francesco Piccoli, et al.

Pedestrian intention recognition is very important to develop robust and safe autonomous driving (AD) and advanced driver assistance systems (ADAS) functionalities for urban driving. In this work, we develop an end-to-end pedestrian intention framework that performs well on day- and night- time scenarios. Our framework relies on objection detection bounding boxes combined with skeletal features of human pose. We study early, late, and combined (early and late) fusion mechanisms to exploit the skeletal features and reduce false positives as well to improve the intention prediction performance. The early fusion mechanism results in AP of 0.89 and precision/recall of 0.79/0.89 for pedestrian intention classification. Furthermore, we propose three new metrics to properly evaluate the pedestrian intention systems. Under these new evaluation metrics for the intention prediction, the proposed end-to-end network offers accurate pedestrian intention up to half a second ahead of the actual risky maneuver.


page 1

page 2

page 3


Predicting Pedestrian Crossing Intention with Feature Fusion and Spatio-Temporal Attention

Predicting vulnerable road user behavior is an essential prerequisite fo...

Action and intention recognition of pedestrians in urban traffic

Action and intention recognition of pedestrians in urban settings are ch...

Looking Ahead: Anticipating Pedestrians Crossing with Future Frames Prediction

In this paper, we present an end-to-end future-prediction model that foc...

RNN-based Pedestrian Crossing Prediction using Activity and Pose-related Features

Pedestrian crossing prediction is a crucial task for autonomous driving....

TrouSPI-Net: Spatio-temporal attention on parallel atrous convolutions and U-GRUs for skeletal pedestrian crossing prediction

Understanding the behaviors and intentions of pedestrians is still one o...

Early warning of pedestrians and cyclists

State-of-the-art motor vehicles are able to break for pedestrians in an ...

Predicting Driver Intention Using Deep Neural Network

To improve driving safety and avoid car accidents, Advanced Driver Assis...

I Introduction

Active safety functionalities for autonomous driving (AD) and advanced driver assistance systems (ADAS) in urban scenarios rely heavily on smart detection of the ego-vehicle environment conditions to enhance people’s safety [11]. While intentions of other visible surrounding vehicles on the road can still be predicted through indicator/blinker signals, accurate detection and prediction of pedestrian and bicyclist intentions still remains a challenge at road cross sections [9, 11, 10]

. Pedestrian intention prediction refers to automatically estimating the positions and intentions of pedestrians in the following few seconds with the goal of evaluating the individual risk associated with respect to the ego vehicle

[10]. The information regarding the relative position of every other road user in future time frames with respect to the ego-vehicle is essential for lowering the false positive rates of collision avoidance alerts and systems. An example of detected risky pedestrians to ego-vehicles is shown in Fig. 1. A smart collision avoidance system that detects pedestrian intention to cross or not can be significantly useful for anticipating potential risk posed to the ego-vehicle by pedestrians.

Fig. 1: Examples of pedestrian intention prediction with respect to the ego-vehicle. Red bounding boxes predicted over a time sequence represent pedestrians that pose risk to the vehicle.

With the surge in object detection and tracking algorithms over the past few years, there have been several works that have been directed towards designing modules towards pedestrian intention and path prediction. In [11]

, the head orientation of pedestrians and their corresponding motion are detected by zooming into the regions corresponding to the head and legs, followed by oriented gradients and local binary pattern features for classification if a pestrian is crossing or not, using support vector machines or convolutional neural networks (CNNs). The work


implements a destination prediction network that is trained using CNN and a long short term memory (LSTM) model followed by a topology and planning network that utilize environmental features. While this work significantly differs from other object detection-based methods, it relies on locally curated datasets for performance analysis.

The work in [3] takes a different approach for pedestrian and bicyclist detection than standard CNN modules. Here, a 9-point skeleton system is fitted for each pedestrian followed by crossing vs not-crossing classification. This work has shown state-of-the-art performances and hence we benchmark this method in this paper. Other work in [5] only focuses on tracking, so it cannot perform the intention prediction.

In spite of the existing works so far, there continues to be a need for an accurate end-to-end pedestrian intention prediction system that can utilize predicted future locations of pedestrians for vehicle ego-motion and path planning. There is a need for leveraging the advantages of various object detection systems and develop models that performs well on day time and night time videos. To tackle these, we explore the spatio-temporal and skeletal fitting methods jointly in a fused system to explore early, late, and combined fusion models to improve overall pedestrian intention prediction on a public data set.

In this work, we present such an end-to-end system that is capable of predicting risky intentions of pedestrians up to 16 frames ahead of the actual action, which corresponds to half a second before the risky maneuver. The main contributions of this work are:

  • We present a novel pedestrian intention prediction system that relies on fusion from recent state-of-the art methodologies in [10] and [3], wherein pedestrian features detected using BBs are combined with pedestrian skeletal features to significantly reduce false positives (see Fig. 2).

  • We describe novel metrics to analyze the accuracy of an end-to-end pedestrian intention prediction/detection system. We present three metrics that capture accuracy of predictions (i) risky crossing behavior of pedestrians up to 16 frames before the motion actually begins, (ii) that motion will continue up to 16 frames in future, and (iii) risky pedestrian motion in up to 16 subsequent frames.

  • We evaluate each module of the proposed system with respect to existing works, specifically to analyze model generalizability with respect to input data. In addition, we annotate specific frames from 50 videos from a public dataset [4] to retrain the skeletal fitting module111Detailed explanation in Supplementary Materials. These annotations are shared for future benchmarking methods222Code and demo available at: https://matthew29tang.github.io/pid-model/#/integrated/.

Fig. 2: Example of proposed fusion system that combines bounding box and skeletal fitting algorithms for pedestrian intention classification.

Ii Materials and Methods

We first describe the mathematical framework followed by explaining the methods and data.

Ii-a Mathematical Framework

Let be the sequence of observations per pedestrian and be the corresponding intention labels (crossing or not-crossing). The proposed framework is able to learn the latent intention given a new set of observations , and infer the labels . The likelihood function of the model parameterized by is given by

The negative log likelihood

, or loss function, for the training samples

can be represented as


The optimal parameters for the learned model can be found as .

Now, for an unseen new observation from the test set

, the most probable label

will be the one that maximizes the trained model under the optimized learned parameters as


In this work we replicate the works in [10] and [3] to benchmark and compare the test performances with the fusion models , , , corresponding to early, late and combination fusion, respectively.

Ii-B Object Detection-based Methods

The Object Detection-based pedestrian intention framework consists of object detector followed by a object tracker and densetnet classifier. We now describe the modules for pedestrian feature extraction followed by online tracking methods and skeletal fitting algorithms.

Ii-B1 Object Detection Module

The YOLOv3 (You Only Look Once) algorithm [8] detects 2D bounding boxes around objects of interest (pedestrians). The anchor boxes corresponding to pedestrians with probability greater than 0.5 are returned as bounding boxes (BBs).

Ii-B2 Online Tracking Module

Once BBs are detected around pedestrians in each image frame, the next step involves tracking each pedestrian across frames with a unique object ID. For this module, we implement two types of online tracking algorithms. The first Simple Online and Realtime Tracking (SORT) algorithm in [10] has one shortcoming that it does not handle occlusions and pedestrians re-entering in a video sequence. Thus, a second tracking algorithm DeepSORT [12] is implemented, where pedestrians’ appearance information is used to improve the performance of SORT. This algorithm allows generation of features for person re-identification that can then be compared with the visual appearance of the pedestrians inside the detected bounding boxes to decrease identity switches in [9].

Ii-B3 DenseNet Module

The next component in our system is a classifier that determines if a pedestrian will cross the street or not. We implement a 121-layer spatio-temporal densenet model [10] and the composite system architecture is shown in Fig. 3.

Fig. 3: The benchmark pedestrian Intention Prediction Architecture (). The SORT/DeepSORT modules are interchangeable.

The densenet in [10] is composed of three dense blocks, where each block comprises of four pairs of and

convolutions, respectively. The dense blocks are separated by transition blocks that perform batch normalization,

convolutions, and average pooling. All of the model layers are interconnected, which means that the input of layer is the combination of output from layer and the outputs from each of the previous layers. These connections significantly reduce the number of training parameters because the network can preserve information from prior weights. For the training process, the densenet takes as input a sequence of 16 frames prior to the action of crossing (if applicable), and produces as output two probability scores for crossing or not-crossing for the entire 16 frame sequence. The choice of 16 frames stems from the intent to yield predictions about 0.5 seconds prior to the actual action [10] since camera frames are acquired at 30fps. The integrated system in Fig. 3

predicts intention by frame, by utilizing a sliding window technique that interpolates frames when the number of frames prior to crossing action is below the minimum requirement (i.e., 16).

Ii-C Skeletal Fitting-based Methods

Based on [3], skeletal fitting models are further applied to bounded pedestrian sub-images to eliminate false positive detections as follows.

Ii-C1 Skeletal-fitting Module

The pedestrian BBs from the object detector are used to crop out pedestrian sub-images from the complete image frames. Next, a skeleton fitting algorithm takes the cropped images as input to apply a skeleton onto the pedestrian. The skeleton can contain up to 17 keypoints [1]. Out of these 17 keypoints, 9 are most significant towards pedestrian classification in [3]. These keypoints consist of the left and right shoulder, hip, knee and ankle, as well as a point between the left and right shoulder as shown in Fig. 4(a). From these 9 keypoints, 396 features based on angles and distances between the skeleton points can be computed for further processing. Examples of the skeleton-fitting process on a sequence of 16 frames is shown in Fig. 4(b).

(a) Skeletal Fitting Model
(b) Skeletons on 16 frame sequence.
Fig. 4: Skeletal fitting on image sequences.

Ii-C2 Random Forest (RF) Module

As an alternative to the densenet model, a RF classifier is implemented for crossing vs not-crossing intent classification. Here, a sliding window method is implemented to extract skeletal features per pedestrian across subsequent frames as in [3]. Thus, features are concatenated per pedestrian followed by RF classification such that the input frames advance by 1 subsequent frame, thereby predicting the intent at the end of 14 frame successions each time.

Ii-C3 Recurrent neural Network (RNN) Module

Additionally, the RNN model is implemented instead of the RF classifier for intention classification. For this implementation, the frames that did not contain any data are padded with -1 and longer sequences are further divided into shorter versions of maximum 45 frames. Further, the target data is modified to enable classifier prediction if the pedestrian crosses in the following 14 frames. The input data is normalized to ensure model convergence. The best performing RNN model is a bidirectional LSTM model with one input layer, two hidden layers with 16 memory units each, followed by a dense output layer with 1 memory unit. The dense output layer uses the sigmoid activation function and the model minimizes (1-classification accuracy) as loss function with Adam optimizer.

Ii-D Fusion Network Model

While the bounding box and densenet systems in [10] and skeletal fitting models in [3] have been analyzed for pedestrian intention detection and prediction, over-detections for crossing action or false positives remain an open problem [11]. In this work, we fuse the spatial features from BBs with the skeletal features in an early, late and combined fusion setting, with the aim to minimize intention classification false positive errors as shown in Fig. 5.

Fig. 5: Proposed fusion system architecture that combines bounding box and skeletal fitting algorithms as early fusion () for intention classification.

Here, early fusion consists of fitted skeletons being superimposed on the bounding box regions per pedestrian per image plane to further track and classify intention. Late fusion comprises pre-computed features corresponding to the fitted skeleton to be fed to the last layer of the densenet model as additional features. Thus, 396 features per frame are accumulated over frames resulting in features being combined at the last densenet layer. Combined fusion refers to the combination of early and late fusion setups. We analyze all three setups with respect to the replicated baseline methods in [10] and [3] to assess the improvement in the overall intention prediction system.

Ii-E Data

To enable robust system design, large volumes of annotated pedestrian videos are needed that fulfill the following conditions: variations in traffic conditions (urban, parking lots etc.), variations in lighting/weather conditions, benchmarkable performances, public availability, metadata included with annotations (e.g., annotations typically include bounding box coordinates, frame number, crossing vs non-crossing label per pedestrian), additional metadata such as pedestrian age, gaze, posture etc., for pedestrian risk post-processing evaluations.

The following datasets are used in this work for training, analysis and benchmarking modular and overall system performances.

Ii-E1 Joint Attention for Autonomous Driving dataset (JAAD)

The JAAD [4] is the only publicly available dataset that fulfills all the aforementioned data requirements. This dataset has been analyzed extensively using bounding box and skeletal fitting models from [10, 11, 3] thus enabling performance benchmarking. In this work, we consider the first 250 videos for model training, while the remaining video sequences from 251 to 346 are considered for the test dataset. All the benchmarking evaluation is carried out on this dataset.

Ii-E2 Common Object in Context (COCO) Dataset

Although JAAD includes a variety of pedestrian metadata with regards to appearances, it does not contain skeletal keypoint features which necessitates the use of (COCO) dataset [1]. We use keypoints corresponding to a pedestrians body to train 9-point skeletal fits models. The COCO data set (with 118,000 training images and 5000 test images) is used to generate night time equivalents using the GAN model in [2] and the night time images are then used to retrain the skeleton fitting model in [3]. This data set includes images annotated for object detection, keypoint detection and semantic segmentation. Next, we manually annotate the first 50 videos in JAAD based on the COCO keypoints using the COCO annotator [1].

Ii-E3 Multiple Object Tracking (MOT)

The MOT challenge is a collection of datasets containing pedestrian annotations. In particular, one of the object trackers in the proposed system is benchmarked on the MOT16 dataset [6].

One limitation of the aforementioned datasets is the lack of night-time videos/sequences. To improve the pedestrian detection/tracking performances for nigh-time sequences as well, we implemented the generative adversarial network (GAN) in [2] to process night time equivalents of daytime videos. This data augmentation method enhanced pedestrian detection rates in poor lighting conditions.

Iii Experiments and Results

Each module of the proposed fusion model is analyzed individually and then in combination to assess their improvements over benchmarked methods. We perform three major experiments in this work. First, we analyze the performances of the object detection, skeletal fitting and classification modules separately with respect to existing benchmarks. Second, we analyze the impact of early, late and combined fusion on intention classification. Third, we analyze the performance of the end-to-end systems with respect to three novel metrics. All modules/models are trained and tested on the same split of the JAAD dataset: the first 250 videos are used for training, and videos from 251 to 346 are used for testing.

The metrics used to assess detection/prediction performances are average precision (AP), precision (indicative of false positive rate), recall (indicative of false negative rate) and accuracy (acc) as described in [7].

Iii-a Benchmarking Modular Performances

Iii-A1 Object Detection/Tracking Performance

We analyze the importance of object detection using the MOT metrics defined in [6]. In Table I the MOT accuracy (MOTA) performances of our implementation of SORT with the annotated groundtruth (GT) is analyzed with respect to the YOLOv3 detections and the VGG16 setup in [11].

Method Overall TUD ETH ETH ADL Venice KITTI
Campus Sunnyday Pedcross2 Rundle-8 -2 -17
SORT with GT (Ours) 34.0 36.8 28.6 33.8 29.8 35.4 45.3
SORT with YOLOv3 (Ours) - 49.4 26.1 38.9 29.0 36.2 36.9
SORT with VGG16 [11] 34.0 62.7 59.1 45.4 28.6 18.6 60.2
TABLE I: Percentage MOTA on Various Sequences.

We find that the implementations of SORT with VGG16 in [11] outperforms the other implementations on most of the reported video sequences. However, the overall MOTA is similar for our implementation and the existing benchmark.

Iii-A2 Skeletal Fitting Performance

To benchmark the skeletal fitting module, the first 50 JAAD video sequences are manually annotated for 17 keypoints per pedestrian as described in the supplementary material. The benchmarks are analyzed on a test set that contains 70 sequences made of 16 frames from JAAD in Table II. Three metrics are analyzed here: ratio of found sequences out of total number of sequences (), which evaluates the number of sequences out of the 70 test sequences where at least one frame is detected and fitted with a skeleton of atleast 4-keypoints; ratio of found skeletons out of total number of frames (), which evaluates the number of skeletons that are found and fitted out of the total frames; ratio of found skeletons in found sequences (), which evaluates the number of skeletons that are found, in relation to the (found sequences)16 frames.

Metric Benchmark [3] Retrained on Retraiend on Retrained on
whole images COCOGAN cropped images
75.71 30 17.14 100
29.11 6.88 4.64 82.50
38.44 22.92 27.08 82.50
TABLE II: Percentage Performance analysis of Skeletal Fitting.

In Table II, we observe a significant improvement in skeletal fitting by retraining on cropped images as shown in Fig. 6(a) and Fig. 6(b), respectively.

(a) Initial Skeletal Fitting
(b) Skeletal fitting retrained on cropped images.
Fig. 6: Improvement in skeletal fitting by retraining on JAAD cropped images.

Iii-A3 Intention Classification Performance

The pedestrian intention classification performance is analyzed in Table III. Here, we are particularly interested in predicting if a pedestrian will cross the road or not a few time frames before to the instant when the action actually begins. For this purpose, we considered a 16 frames interval (around 0.5 seconds in the JAAD dataset) before the frame in which the pedestrian starts crossing according to GT. We observe that the early fusion model results in about 7% increment in recall and AP over the existing benchmark. The RF and RNN classifiers are evaluated on a subset of the data when compared to the Densenet model as explained in the supplementary material.

Classifier Features used for 16 frames AP Precision Recall
DenseNet [10] Cropped BBs of Pedestrians 82 74 82
DenseNet (Ours) Early fusion model 89 79 89
RF (Ours) Skeleton features 81 70 84
RNN (Ours) Skeleton features 75 73 79
TABLE III: Percentage Performance of Classification.

Iii-B Fusion Network Model Performance

The impact of early, late and combination fusion models is analyzed in Table IV. Here, training data is JAAD videos 1-250 and test set are videos 251-346. From Table IV we observe that early fusion is the best approach for the overall system. The primary advantage of early fusion is that it enables pedestrian specific features being extracted by the densenet from the superimposed skeletons on BBs, which leads to lower false positives and higher accuracy.

Metric Benchmark [10] () Late Fusion () Early Fusion () Combination ()
Acc 67.5 53.6 75.6 39.0
Loss 0.96 1.94 0.93 1.28
AP 82.8 54.2 89.0 48.5
TABLE IV: Percentage Performance of Fusion Models.

Iii-C End-to-end System Performance Analysis

Finally, we analyze the performance of the proposed end-to-end system which includes the object detector, tracker and intention classifier. The test data is prepared such that for each crossing pedestrian, their last 30 frames including the frame in which the pedestrian crossed is considered as a test sequence.

We introduce three novel metrics to assess the performances of the end-to-end models in the moments before and concurrent to the pedestrian crossing action. The three metrics are:

: the accuracy in predicting the crossing action exactly 16 frames (same as the sliding window used for densenet) before it takes place. This implies prediction regarding crossing intention/or not at the frame regarding an action at time frame. : The accuracy in predicting the crossing action in the frame where the action actually takes place. This implies using information from to time frames to predict a crossing or not crossing action at time frame. : the percentage of “crossing” or “not crossing” prediction in the 16 frames proceeding the action. This implies the average accuracy for predicting an action anywhere between the to frame. For example, if the GT tells us that the pedestrian is crossing at frame 17, we will check whether our system predicts “crossing” at frame 1, at frame 17, and the percentage of “crossing” predictions between frames 1 and 16 using , respectively. We perform the same procedure for each pedestrian in all the videos to calculate the average percentage accuracy. We exclude the pedestrians whose crossing action happened before the 16th frame, since are unable to capture this instance.

Model Description
A [10] YOLOv3 SORT DenseNet 37 60 55
B YOLOv3 DeepSORT DenseNet 36 30 32
C YOLOv3 SORT Early-fused Skeleton DenseNet 45 58 57
D YOLOv3 DeepSORT Early-fused Skeleton DenseNet 40 47 45
TABLE V: Percentage Performance Analysis for end-to-end system.

Table V shows that model C has the highest accuracy for predicting the crossing intention up to 16 frames ahead with respect to and . Thus, when compared with model A and B, the proposed model C can better predict intention with the fusion of skeletons. Although model C is not the highest in , its accuracy is very close to the highest.

Iv Conclusions and Discussion

In this work we implement multiple fusion models to combine spatio-temporal features with fitted skeletal features to enhance pedestrian intention prediction with respect to state-of-the-art works in [10, 3]. We observe similar to significant improvement in every module with respect to benchmarks owing to additional training on JAAD annotated skeletons on cropped bounding box images. Additionally, we observe that early fusion significantly outperforms late and combination fusion systems. Future works will be directed towards further improving the false negative instances to enable accurate safety distance estimations for ego-vehicle maneuvers.


  • [1] J. Brooks (2019) COCO Annotator. Note: https://github.com/jsbroks/coco-annotator/ Cited by: §II-C1, §II-E2.
  • [2] S. R. Chowdhury, L. Tornberg, R. Halvfordsson, J. Nordh, A. S. Gustafsson, J. Wall, M. Westerberg, A. Wirehed, L. Tilloy, Z. Hu, H. Tan, M. Pan, and J. Sjöberg (2019)

    Automated augmentation with reinforcement learning and gans for robust identification of traffic signs using front camera images

    In 2019 53rd Asilomar Conference on Signals, Systems, and Computers, Vol. , pp. 79–83. Cited by: §II-E2, §II-E3.
  • [3] Z. Fang and A. M. López (2019)

    Intention recognition of pedestrians and cyclists by 2d pose estimation

    IEEE Transactions on Intelligent Transportation Systems. Cited by: 1st item, §I, §II-A, §II-C1, §II-C2, §II-C, §II-D, §II-D, §II-E1, §II-E2, TABLE II, §IV.
  • [4] I. Kotseruba, A. Rasouli, and J. K. Tsotsos (2016) Joint attention in autonomous driving (jaad). arXiv preprint arXiv:1609.04741. Cited by: 3rd item, §II-E1.
  • [5] Y. Li, Q. Zhai, S. Ding, F. Yang, G. Li, and Y. F. Zheng (2019) Efficient health-related abnormal behavior detection with visual and inertial sensor integration. Pattern Analysis and Applications 22 (2), pp. 601–614. Cited by: §I.
  • [6] A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler (2016) MOT16: a benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831. Cited by: §II-E3, §III-A1.
  • [7] S. L. N. Rafael Padilla and E. A. B. da Silva (2020) Survey on performance metrics for object-detection algorithms. International Conference on Systems, Signals and Image Processing (IWSSIP). Cited by: §III.
  • [8] J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §II-B1.
  • [9] E. Rehder, F. Wirth, M. Lauer, and C. Stiller (2018) Pedestrian prediction by planning using deep neural networks. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–5. Cited by: §I, §I, §II-B2.
  • [10] K. Saleh, M. Hossny, and S. Nahavandi (2019) Real-time intent prediction of pedestrians for autonomous ground vehicles via spatio-temporal densenet. In 2019 International Conference on Robotics and Automation (ICRA), pp. 9704–9710. Cited by: 1st item, §I, §II-A, §II-B2, §II-B3, §II-B3, §II-D, §II-D, §II-E1, TABLE III, TABLE IV, TABLE V, §IV.
  • [11] D. Varytimidis, F. Alonso-Fernandez, B. Duran, and C. Englund (2018) Action and intention recognition of pedestrians in urban traffic. In 2018 14th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS), pp. 676–682. Cited by: §I, §I, §II-D, §II-E1, §III-A1, §III-A1, TABLE I.
  • [12] N. Wojke, A. Bewley, and D. Paulus (2017) Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (ICIP), pp. 3645–3649. Cited by: §II-B2.