VRUNet: Multi-Task Learning Model for Intent Prediction of Vulnerable Road Users

by   Adithya Ranga, et al.

Advanced perception and path planning are at the core for any self-driving vehicle. Autonomous vehicles need to understand the scene and intentions of other road users for safe motion planning. For urban use cases it is very important to perceive and predict the intentions of pedestrians, cyclists, scooters, etc., classified as vulnerable road users (VRU). Intent is a combination of pedestrian activities and long term trajectories defining their future motion. In this paper we propose a multi-task learning model to predict pedestrian actions, crossing intent and forecast their future path from video sequences. We have trained the model on naturalistic driving open-source JAAD dataset, which is rich in behavioral annotations and real world scenarios. Experimental results show state-of-the-art performance on JAAD dataset and how we can benefit from jointly learning and predicting actions and trajectories using 2D human pose features and scene context.



There are no comments yet.


page 1

page 3

page 4

page 6

page 7

page 8


PePScenes: A Novel Dataset and Baseline for Pedestrian Action Prediction in 3D

Predicting the behavior of road users, particularly pedestrians, is vita...

Peeking into the Future: Predicting Future Person Activities and Locations in Videos

Deciphering human behaviors to predict their future paths/trajectories a...

Coupling Intent and Action for Pedestrian Crossing Behavior Prediction

Accurate prediction of pedestrian crossing behaviors by autonomous vehic...

IntentNet: Learning to Predict Intention from Raw Sensor Data

In order to plan a safe maneuver, self-driving vehicles need to understa...

How Shall I Drive? Interaction Modeling and Motion Planning towards Empathetic and Socially-Graceful Driving

While intelligence of autonomous vehicles (AVs) has significantly advanc...

SCAN: A Spatial Context Attentive Network for Joint Multi-Agent Intent Prediction

Safe navigation of autonomous agents in human centric environments requi...

Pedestrian Action Anticipation using Contextual Feature Fusion in Stacked RNNs

One of the major challenges for autonomous vehicles in urban environment...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the advancements in artificial intelligence (AI) and deep learning we are able to tackle some of the challenging problems, in the field of autonomous systems and robotics. Automated Driving (AD) or self-driving vehicles are becoming more common on urban and highway roads, and are able to handle many complex driving scenarios. Figure


illustrates the entire AD architecture starting from sensing, all the way to lateral and longitudinal control of the vehicle. Advanced perception including object detection and scene understanding, followed by detailed abstraction of information specific to individual objects in the scene, are crucial for path planning.

One of the key challenges with path planning for automated driving on urban roads is that the vehicles have to constantly interact with pedestrians, cyclists, scooters etc. generally identified as Vulnerable road user’s (VRU). VRU’s on road move with specific goals in mind, while respecting certain rules and also directly interacting with the other actors in scene. It is increasingly apparent these days, that many pedestrians are distracted using devices (eg. cell phones, headset’s) as seen in Figure 1, and put themselves and the traffic around at higher risk. This survey by Rasouli et al.[22]

discusses all the factors to be considered for understanding VRU’s and their interactions. Human cognition is very good at understanding and anticipating their actions and intentions on road. To achieve naturalistic driving behavior and safely interacting with other road user’s, AI needs to match these levels of human intelligence. This problem of activity recognition and prediction, is gaining significant attention from the AI and computer vision community, especially focused towards applications of automated driving and surveillance.

Figure 1: Distracted pedestrians while walking/crossing on urban roads

Most of the earlier research methods propose a model based approach or probabilistic solution to predict the intent of VRU’s in real time. More recently, with the availability of open-source datasets for action recognition and automated driving, many machine learning/deep learning models are being developed. Most of these approaches focus on activity recognition or address intent as a time-series prediction problem separately. The overall intention of actors on road can be better summarized from their short term discrete actions and continuous forecasting of their future positions. The actions performed by pedestrians in the past can be used to predict their future actions and how they move on the road. Pedestrians in the scene also navigate respecting some defined rules of the scene and interacting with other objects in the scene, like walking on the cross-walk while crossing, stopping for traffic at traffic lights, yielding for other vehicles, etc. To this end, we propose a single multi-task learning model that jointly predicts the actions, crossing intent and also trajectories from video sequences. Firstly, to better understand the actions of the persons, we abstract low level features regarding their body pose and efficiently track them in the scene. Secondly for intent prediction we use scene semantics as an additional input to the multi-task prediction model. Experiments on JAAD

[14] dataset show better results than the baseline, for pedestrian intent using this multi-task learning approach.

The rest of the paper is organized as follows: a review of the related work is presented in Section 2. Details regarding our research methodology and network architecture are discussed in Section 3. Experiments conducted on JAAD and the results are illustrated in Section 4. Section 5 concludes the paper.

Figure 2: Modules in an autonomous driving pipeline.

2 Related Work

2.0.1 Activity Recognition and prediction

Human activity recognition and prediction from video data has been studied for some time now. Many works have been proposed using Recurrent Neural Networks (RNN).

[31], [6] and [32] are some of the recent works on action recognition from videos using RNN’s for single individuals.[18] proposes using LSTM’s for early activity detection. Also there are some recent methods [9], [5] for action recognition for groups and [3] discusses an end-to-end approach of detecting persons and jointly predicting their behaviors.

2.0.2 Trajectory Prediction

Predicting the future path of pedestrians in videos is a well known area of research. Most of the models are either model based, probabilistic models or more recently - deep learning based approaches. The authors in [29] compare constant velocity models (CVM) with probabilistic and deep learning approaches for trajectory prediction. In [23], a method for improving the performance of model based prediction using goal or final destination of pedestrians as latent variable is proposed. Well known deep learning approaches use RNN’s like LSTM encoder-decoder architecture trajectory prediction like [30]. More recent deep learning approaches focus on improving the trajectories considering human social interaction in crowds like the work done in Social-LSTM [1]. Social-GAN [8] used adversarial training and SR-LSTM [35] added a refinement module to LSTM network using neighbors intentions, aiming at improving the accuracy of trajectory performance in crowded scenes. Also Scene-LSTM [20] uses scene data along with pedestrian data and a grid structure to learn and predict human trajectories with reduced location displacement errors. Most of these deep learning and socially aware approaches are derived on static cameras looking at crowded scenes without considering the current or past actions or behaviors of the persons.

2.0.3 Intent Prediction for Automated driving

Recently, with increased attention towards pedestrian safety and AD, research activities [22] are focusing more on behavioral science of pedestrians and their interactions with other road users. Many intention prediction models have been proposed so far especially for pedestrians.

Model based approaches like [28] and [10]

rely on dynamic models and probabilistic estimation techniques. In some recent applications like

[13] pedestrian context like awareness when looking at the vehicle and metrics like distance to the curb were taken into consideration to improve the accuracy of intention prediction with dynamic models. In [19] social forces or interactions were added to the dynamic models to further improve the accuracy of model based intention prediction. Most of the model based work is done with staged scenarios, while treating the pedestrian as any other point in space and not including any scene context.

Machine learning (ML) approaches in [11] and [12] use SVM and pose features to predict the crossing intent. To further improve the performance, in [21] and [27], scene context like existence of a traffic light, cross walks and lane signature were included for a sequence of frames. More recently researchers in [7] have used image sequences and skeleton-based features for pedestrians and cyclists and predicted crossing/not-crossing intent values.

Deep learning approaches in [21] are used to predict the walking/standing and looking/not-looking state of pedestrians separately from cropped images. Intent as a time-series trajectory prediction problem is discussed in [26] and a stacked LSTM model is used to predict the future positions of VRU’s without considering their 2D pose, behaviors and scene context. In [25] researchers propose a multi-tasking model to predict the pose orientation and standing/walking behavior of pedestrians from images. More recently work done by researchers in [16] is very interesting as they jointly train a multi-tasking model to predict activities and trajectories as an auxiliary task. The work is focused on a static camera configuration overlooking crowded scenarios, whereas our approach is focused on naturalistic driving scenarios for automated driving.

3 Methodology

In urban scenarios, history of pedestrians behaviors like gait i.e. if someone is walking/standing, awareness levels, orientation, distraction and social interactions, influence the future state or actions. Also the future location or goal of the person in the scene can be determined by their past actions and correlating it with scene semantics. Motivated by this, we developed a single multi-tasking model that predicts the behaviors, crossing intentions and future trajectories of VRU’s in the scene.

Our system as shown in Figure 3 processes sequences of frames from camera first through the perception backbone, to obtain 2D pose or skeleton for all the persons in the scene and track them throughout the sequence. Also we process each frame to extract scene semantics using a semantic segmentation model. Given the 2D pose, bounding box and scene context from time 1 to , the model classifies the current state or action at for each person, and predicts the future crossing/not-crossing intent at time . The model also simultaneously predicts the future positions in the image i.e. the trajectory from time to . Here horizon is the defined prediction duration or look ahead into the future.

Figure 3: Overview of the our research approach. Video sequences are processed through perception backbone to generate tracked object (2D pose and bounding box) and scene (segmentation mask) context. Joint model is trained on the perception outputs to predict actions, crossing intent and trajectory in the videos.

3.1 Dataset

One of the main challenges for deep learning based intent prediction for AD, is creating behavioral annotations for VRU’s from natural driving data. This problem is mostly addressed by JAAD dataset [14], that contains rich behavioral labels for persons that interact with the driver and also provides possible intent values for specific persons in the scene. The dataset also provides additional contextual labels for each person and high level scene annotations are available. Dataset doesnt include ego-vehicle odometry information currently.

Additionally, we leverage COCO [17] and PoseTrack datasets [2] for fine-tuning the 2D pose detection backbone network. For scene context, a semantic segmentation network is pre-trained on Cityscapes dataset [4] and is used to obtain the encoded scene masks for our model.

3.2 Formal definition of Behaviors and Intention

In our model we focus on the following actions and crossing intent of VRU’s:

  • Gait: If the person in scene is ”Walking or Standing”

  • Attention: If the person is directly looking at the vehicle or not (Looking / Not Looking)

  • Orientation: the pose orientation of the person with respect to the viewing angle (Left / Right / Front / Back)

  • Distraction: If the person is distracted with a phone (Phoning / Not Phoning)

  • Crossing Intention: This tells us if the person will be crossing or not the road/lane in front of the vehicle at time , where ”horizon” is the prediction period in seconds (Crossing / Not Crossing)

We process the JAAD data to refine annotated labels to the above described class labels for training and testing the model performance.

3.3 Perception Backbone - Pose Estimation and Scene Understanding

In many previous approaches for activity recognition and prediction, persons in the scene are represented as any other points or objects in space as bounding boxes. Deep networks are trained on the full image, video sequences or cropped bounding boxes to recognize the actions. However most of the high level actions for humans can be abstracted from their skeleton or pose information. By representing the person as a 2D skeleton most of the dynamics could be accurately captured and the performance could greatly improve using less denser networks. To this end a pre-trained pose estimation network, PifPaf [15] trained on COCO dataset [17] and fine-tuned on Posetrack [2], is used to obtain the keypoints and object bounding boxes. There are a total of 17 keypoints - , detected for each person at any given instance where and are the pixel locations and is the visibility score of keypoint . Figure 4 shows the visualization of 2D pose prediction and boxes for sample persons as seen from JAAD data.

Every person in the scene needs to be tracked through the sequence, to model the temporal changes by observing the change in their features. To track multiple pedestrians in the scene we use the research approaches from [33] and [34]

. This tracking technique uses the measured state of the detected object in the scene and a Kalman filter to track and update it over time. There is an additional CNN model trained on large scale person re-identification dataset

[36], where the CNN features are associated using a defined metric to improve tracking performance. Each person in the scene is tracked and the 2D pose, boxes are extracted for the track. For a track or sequence length of N, each person in the scene will have keypoints or pose features of dimension and bounding boxes of dimension .

(a) child and adult
(b) cyclist
(c) adult using phone
(d) adult with cart
Figure 4: 2D human pose prediction visualization on JAAD data

The perception backbone also includes a semantic segmentation module with a VGG16 encoder and UNET [24] decoder architecture that is pre-trained on cityscapes dataset. This module associates all the pixels of the scene with their respective classes and gives us a full scene understanding. For an input image resolution of the segmentation mask output is of the same resolution where each pixel location has the class index. Hence the dimension of the scene context mask for a sequence length N is . Given the model was not trained on JAAD we see domain gap where the performance is lower for some classes.

(a) segmentation visualization on cityscapes test set
(b) segmentation visualization on JAAD data
Figure 5: semantic segmentation visualization on cityscapes (top) and JAAD data (bottom)

3.4 Intent and Trajectory Prediction

3.4.1 Modular approach

As a part of this research, we have trained separate models for each task as shown in Figure 7. This work is done to establish a baseline for each task separately for later comparison. We use 2D human pose and pre-calculate the geometrical features like angles and distances between joints and keypoints. The following mentioned models are separately trained with specific features to predict the behaviors (gait, attention, distraction), orientation and overall crossing intent.

  • Gait Model: As mentioned in 3, for gait we predict if the person in walking or standing in the scene. To determine the gait state of persons we use the keypoints for legs (knees and ankles). We calculate the features , where and are respectively the distances between right/left ankles and knees, and are angles between the limbs and and is the hip center of the person. We stack the features for a sequence observation length of N, so that the temporal change in features is captured well. A 1D Resnet-10 model is trained for the binary classification task with input shape

    and to optimize a cross-entropy loss function.

  • Attention + Orientation: For this task we mainly focus on the keypoints of the upper body i.e. head – eyes, nose, ears, and shoulders. A total of 7 keypoint values i.e. for the stacked last frames are used as input [input shape - ], to train a separate 1D Resnet-10 model. The model predicts both attention (Looking or Not Looking) and orientation (Left, Right, Front or Back) simultaneously, and is trained with a weighted cross-entropy loss.

  • Distraction: This is a binary classification task to determine if the VRU is phoning. Angles , where and are the angles between lower arm and upper bicep for each hand, and

    are the angles between left and right hands and upper bicep respectively. These pre-calculated features are stacked over the last N frames and we treat it as a binary classification task using a support vector classifier (SVC) with radial basis function (RBF) kernel.

  • Crossing Intent: The main goal here is to predict if the person in the scene will cross or not in front of the vehicle at some defined future time , using persons context and scene information from time 1 to as input. For training this model we use the past frames and generate all the behavior states (gait, attention, distraction) and orientation values from the pre-train action recognition models. Additionally, we do binary encoding of scene context annotations from JAAD for the presence of (traffic light/sign, cross walk, lane width(narrow/wide)). Given this we have a sized input with 9 features. A simple support vector model is fitted, to classify whether the person will cross the road in front or not.

3.5 VRUNet model architecture

Intent of all the persons in scene has strong temporal dependency with the past actions, how they navigate the scene while interacting with other actors. To this end, we propose an end-to-end trained multi-task model as shown in Figure 6. Using this multi-tasking approach we jointly predict the actions, crossing intent and trajectory of VRU’s from video sequences. Multi-tasking reduces the overall compute and memory requirements, by weight sharing. Inputs to this model are 2D pose features, object bounding boxes and scene semantic masks that are processed from the perception backbone. Using the object context and scene context from time 1 to , we classify actions(gait, attention, distraction) and orientation of the person and predict crossing intent at time . Also we jointly predict the positions of the VRU from time to in image coordinates. Input sizes to the network and the resolution of scene mask are as defined in Table 1.

Name Size
Input Image
Input Scene Mask
Input Box
Input Pose
Table 1: Input Layers for VRUNet

Input size of pose features is where will be the maximum number of VRU instances we train in any given sequence. Pose input is first processed through a embedding layer as shown in Table 2 comprising of 2D convolutions. The output of the pose embedding layer is then input to a stacked LSTM encoder and then finally processed through fully connected layers. Similarly the input size for bounding box features is and is processed through embedding layers followed by stacked LSTM architecture as defined in Table 2.

Name Size Filters Stride
Conv 2
Conv 2
LSTM 256 N/A
FC 256 N/A
FC 256 N/A
Table 2: Pose and Bounding Box Encoding Layers for VRUNet

Scene semantic segmentation mask from the perception backbone has 5 classes (road, car, pedestrian, sidewalk, traffic sign) that we use for this model. The segmentation mask is then binary encoded to produce sematic features of shape , where classes is 5. Semantic segmentation model outputs the mask with a resolution of . This is then reshaped to a resolution of , hence the shape of input scene features after binary encoding for the sequence is

. This input is then used to compute a mean mask along the time axis before processing through the model. It is then encoded using 2D convolution and max pooling layers followed by fully connected layers as shown in Table


Name Size Filters number Stride
Conv 256 2
Conv 256 2
Maxpool N/A 1
Conv 512 2
Conv 512 2
Maxpool N/A 1
FC 1024 N/A N/A
FC 1024 N/A N/A
FC 256 N/A N/A
FC 256 N/A N/A
Table 3: Scene Encoding Layers for VRUNet

Outputs from scene, pose and bounding box encoding branches are fused channel-wise and processed through separate branches with fully connected layers to predict the actions and crossing intent. Given the sequence of encoded poses and bounding boxes, the model outputs the actions and intent probabilities at time

. The softmax probabilities from each task are then used to calculate the specific action/behavior loss value , where is the activity, are the maximum number of VRU instances from the sequence and are the ground truth and predicted class labels respectively. Using the following weighted sum of individual cross-entropy losses, we jointly train all the classification tasks:


For trajectory prediction we use a LSTM encoder-decoder configuration focusing on the bounding box inputs. The encoded inputs are passed to the LSTM decoder stack along with the internal state and output future box center positions in pixels. Given the bounding box sequences from time 1 to , the model predicts the future bounding box centers from time to . This regression task is trained to optimize mean square error(MSE) loss function . Here are the ground truth and predicted pixel values of the object box centers respectively. We added a L2 regularization term to avoid overfitting . The total loss function that is jointly optimized as weighted sum of classification and regression losses respectively.

Figure 6: VRUNet Multi-task Model Network Architecture
Figure 7: Modular approach showing specific features and models used for each task

4 Experiments and Results

4.1 Dataset Processing and Augmentation

JAAD dataset has 2D bounding boxes and tracks annotated for persons in the scene and only a unique set of pedestrians come with behavior labels and respective contextual information. For our training we mainly concentrate on the pedestrians that are not heavily occluded (less than 25% visible) and with crossing intent and behavior annotations (less than 30%). This gives us very few sequences (

40K frames) and also has imbalance in class labels. The number of pedestrians that are actually seen to be distracted or phoning are almost less than 20% of all behavior labels. Firstly, we extract the tracks for pedestrians, with durations greater than 1.5 seconds. Using a sliding window we process the sequences to generate tracks with observation lengths of 0.5s and 1s i.e. 15 and 30 frames (given 30Hz sequences), and prediction lengths of 1s i.e. 30 frames. We pad some tracks with shorter sequence lengths at the beginning and end of the tracks in order to have a fixed sequence length for the entire dataset.

The generated tracks are then processed through the perception pipeline to extract 2D pose and semantic segmentation masks. Additional training tracks are generated from augmentation by flipping the sequences, adding pixel dropout and random noise to scene masks and pose keypoints.

4.2 Training Implementation

In this section we describe how we train the baseline models for individual action recognition tasks using pre-computes pose features. We present the network architecture for the multi-task model using VRUNet, and then compare our results to the baseline models and JAAD benchmark [21].

Modular Approach - Baseline Training: To classify gait (Walking or Standing), a Resnet10 model with 1D convolutions is trained using input features of length (from section 3.4

) and outputs class probability. Model is trained to optimize a binary cross-entropy loss, for 100 epochs using an initial learning rate (

lr) of 0.0001 and adam optimizer for a batch size of 32. For attention and orientation we use the same model considering that these are mostly determined by common features of the person’s face and shoulders. A Resnet10 model with 1D convolutions is jointly trained for these two different tasks, using a weighted cross-entropy loss. The model is trained for 150 epochs with adam optimizer. The model outputs probabilities for attention and orientation, and jointly classifies if the person is Looking or not and if so in what direction (Left, Right, Front, Back) is he relatively oriented. A distraction model is fitted with a SVM classifier using averaged pose features for a sequence length of N (see section 3.4).

These models for individual tasks are used to generate action predictions on the tracks for crossing intent. Now given the actions for a VRU for the entire track history and the scene context (see section 3.4) we classify if the pedestrian is Crossing or not. This is trained with the ground truth label from the last sample of the prediction length of track (1 second horizon).

VRUNet Training: We use 60%, 20% ,20% splits for training, validation and test sequences from all the tracks. Inputs to the model are 2D pose and bounding box sequences for each person in the scene and also the scene mask as shown in 1. Pose and bounding boxes are normalized with input resolution. Segmentation mask output is binary encoded for each class (classes = 5). Input sequences of object and scene context for each person are for the observed length (1 to ). For action recognition (gait, attention and distraction) and orientation we use the training labels from current time (). For crossing intent the training label is the state at end of prediction horizon (). We use a stateless LSTM for encoding pose and bounding box features and pass the state of the encoder to the decoder for the trajectory branch. For trajectory training we use the future horizon bounding box centers as training labels from time to

. Hyperparameters used for training are as shown in Table


Name Value
Batch Size 32
Learning Rate with adaptive step change
Optimizer Adam
Epochs 500
Regularization L2 for Trajectory Outputs 0.0003

LSTM with tanh and Conv2D with relu

Table 4: Training Hyperparameters

4.3 Results

In Table 5 we have the comparison for average precision (AP) of the action recognition and crossing intent prediction for a 1 second horizon. We present the results for the modular approach i.e. separately trained models for two observation durations of 0.5 and 1 seconds and for the VRUNet multi-task model with observation duration of 1 second or 30 frames.

AlexNet 78.34 67.45 N/A N/A N/A
AlexNet-Pre 80.45 75.23 N/A N/A N/A
AlexNet-Crop 83.45 80.23 N/A N/A N/A
Context N/A N/A N/A N/A 62.73
Modular (0.5s) 91.32 87.5 86.45 83.52 66.87
Modular (1s) 93.27 89.35 87.43 83.27 67.34
VRUNet (1s) 79.35 78.27 81.23 82.37 73.47
Table 5: Average Precision (AP%) for behaviors and intent predictions (GAIT - gait, ATTN - attention, DIST - distraction, ORNT - orientation, XNG - crossing intent)(Red - Ours)

We see that the overall AP of individual action recognition tasks is much better when we train for each task separately. The crossing intent prediction for a 1 second horizon is better when the history or observation length is higher (1s). Using the 2D pose for persons and their features in separate models (modular), we outperform the accuracy for gait (Walking/Standing) and attention (Looking/Not Looking). Also adding some high level scene context to the crossing intent prediction task we see that the accuracy is higher than the baseline from JAAD authors. Using our multi-task VRUNet model, action recognition AP% is lower compared to modular approach, as we avoid over-fitting for any specific task with weight sharing and weighted loss function. We see that we outperform the overall crossing intent accuracy, and this shows the benefit of including low level features from scene masks as input features for training. For trajectory prediction we currently do not have a baseline comparison to any models. The predicted trajectory is filtered and we fit a third order polynomial. We then generate trajectory points from the polynomial using pixel positions. Qualitative results for action recognition, intent prediction and trajectory are shown in Figures 8 and 9. In Figure 8(a) we see the predicted trajectory points in red and the actual sampled ground truth boxes for the pedestrian for 1s horizon are shown in green.

Figure 8: Distracted cyclist not crossing the road with the class labels from VRUNet prediction (Not crossing, phoning, facing front, walking, aware)
(a) crossing sequence prediction of a distracted crossing pedestrian. Sequence start to end (left to right), model starts to predict the intent as crossing from second frame instance. Red color trajectory is the smooth and fitted prediction from multi-task model.Green boxes are the actual ground truth values for horizon of 1s.
(b) crossing sequence prediction of pedestrian.
Figure 9: Results visualization of VRUNet prediction

5 Conclusions

In this paper, we have presented a multi-task learning model to jointly predict the actions, crossing intent and also the trajectories for VRU’s. We have developed the model, by using low level pedestrian context i.e. 2D human pose and object bounding boxes, and scene context in the form of segmentation masks. We separately encode the visual features for the sequence of VRU poses, boxes and segmentation masks and finally fuse them to predict action states (gait, attention, distraction), orientation and crossing intent of VRU’s. We also simultaneously predict the trajectory or future positions in images, using the encoded box features. We showed the improved performance of action prediction and crossing intent using this approach on the JAAD dataset. In addition we have trained separate models for each task to establish a benchmark and compare the results with the multi-tasking approach. We specifically see that the model generalizes well for all the action recognition tasks and crossing intent prediction improves using the scene context.

In future research we could achieve better accuracy for action recognition tasks and also increase the prediction horizon for crossing intent with improved datasets and better model architectures. Vehicle odometry information as input could improve the accuracy for trajectory prediction and overall intent prediction. Our future work will focus on including social interaction between VRU’s, and interactions between the person and other objects in the scene to further improve the quality of predictions. Our work has not yet been validated for different populations which is very important to be able to have such behavioral intelligence for automated driving.


  • [1] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese (2016-06) Social lstm: human trajectory prediction in crowded spaces. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §2.0.2.
  • [2] M. Andriluka, U. Iqbal, E. Ensafutdinov, L. Pishchulin, A. Milan, J. Gall, and S. B. (2018) PoseTrack: A benchmark for human pose estimation and tracking. In CVPR, Cited by: §3.1, §3.3.
  • [3] T. Bagautdinov, A. Alahi, F. Fleuret, P. Fua, and S. Savarese (2017-07) Social scene understanding: end-to-end multi-person action localization and collective activity recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.0.1.
  • [4] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016) The cityscapes dataset for semantic urban scene understanding. External Links: 1604.01685 Cited by: §3.1.
  • [5] Z. Deng, A. Vahdat, H. Hu, and G. Mori (2015) Structure inference machines: recurrent neural networks for analyzing relations in group activity recognition. External Links: 1511.04196 Cited by: §2.0.1.
  • [6] Y. Du, W. Wang, and L. Wang (2015-06) Hierarchical recurrent neural network for skeleton based action recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.0.1.
  • [7] Z. Fang and A. M. López (2019) Intention recognition of pedestrians and cyclists by 2d pose estimation. External Links: 1910.03858 Cited by: §2.0.3.
  • [8] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi (2018) Social gan: socially acceptable trajectories with generative adversarial networks. External Links: 1803.10892 Cited by: §2.0.2.
  • [9] M. Ibrahim, S. Muralidharan, Z. Deng, A. Vahdat, and G. Mori (2015) A hierarchical deep temporal model for group activity recognition. External Links: 1511.06040 Cited by: §2.0.1.
  • [10] C. G. Keller and D. Gavrila (2014) Will the pedestrian cross? a study on pedestrian path prediction. IEEE Transactions on Intelligent Transportation Systems 15, pp. 494–506. Cited by: §2.0.3.
  • [11] S. Koehler, M. Goldhammer, S. Bauer, K. Doll, U. Brunsmann, and K. Dietmayer (2012-09) Early detection of the pedestrian’s intention to cross the street. pp. 1759–1764. External Links: ISBN 978-1-4673-3064-0, Document Cited by: §2.0.3.
  • [12] S. Koehler, B. Schreiner, S. Ronalter, K. Doll, U. Brunsmann, and K. Zindler (2013-06) Autonomous evasive maneuvers triggered by infrastructure-based detection of pedestrian intentions. pp. 519–526. External Links: ISBN 978-1-4673-2754-1, Document Cited by: §2.0.3.
  • [13] J. F. P. Kooij, N. Schneider, F. Flohr, and D. Gavrila (2014) Context-based pedestrian path prediction. In ECCV, Cited by: §2.0.3.
  • [14] I. Kotseruba, A. Rasouli, and J. K. Tsotsos (2016) Joint attention in autonomous driving (jaad). External Links: 1609.04741 Cited by: VRUNet: Multi-Task Learning Model for Intent Prediction of Vulnerable Road Users, §1, §3.1.
  • [15] S. Kreiss, L. Bertoni, and A. Alahi (2019-06) PifPaf: composite fields for human pose estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.3.
  • [16] J. Liang, L. Jiang, J. C. Niebles, A. Hauptmann, and L. Fei-Fei (2019) Peeking into the future: predicting future person activities and locations in videos. External Links: 1902.03748 Cited by: §2.0.3.
  • [17] T. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár (2014) Microsoft coco: common objects in context. External Links: 1405.0312 Cited by: §3.1, §3.3.
  • [18] S. Ma, L. Sigal, and S. Sclaroff (2016-06) Learning activity progression in lstms for activity detection and early detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.0.1.
  • [19] F. Madrigal, J. Hayet, and F. Lerasle (2014) Intention-aware multiple pedestrian tracking. 2014 22nd International Conference on Pattern Recognition, pp. 4122–4127. Cited by: §2.0.3.
  • [20] H. Manh and G. Alaghband (2018) Scene-lstm: a model for human trajectory prediction. External Links: 1808.04018 Cited by: §2.0.2.
  • [21] A. Rasouli, I. Kotseruba, and J. K. Tsotsos (2017-10) Are they going to cross? a benchmark dataset and baseline for pedestrian crosswalk behavior. In The IEEE International Conference on Computer Vision (ICCV) Workshops, Cited by: §2.0.3, §2.0.3, §4.2.
  • [22] A. Rasouli and J. K. Tsotsos (2018) Autonomous vehicles that interact with pedestrians: a survey of theory and practice. External Links: 1805.11773 Cited by: §1, §2.0.3.
  • [23] E. Rehder and H. Kloeden (2015-12) Goal-directed pedestrian prediction. In The IEEE International Conference on Computer Vision (ICCV) Workshops, Cited by: §2.0.2.
  • [24] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. CoRR abs/1505.04597. External Links: Link, 1505.04597 Cited by: §3.3.
  • [25] K. Saleh, M. Hossny, and S. Nahavandi (2017-10) Early intent prediction of vulnerable road users from visual attributes using multi-task learning network. pp. 3367–3372. External Links: Document Cited by: §2.0.3.
  • [26] K. Saleh, M. Hossny, and S. Nahavandi (2017-10) Intent prediction of vulnerable road users from motion trajectories using stacked lstm network. pp. 327–332. External Links: Document Cited by: §2.0.3.
  • [27] F. Schneemann and P. Heinemann (2016-10) Context-based detection of pedestrian crossing intention for autonomous driving in urban environments. pp. . External Links: Document Cited by: §2.0.3.
  • [28] N. Schneider and D. Gavrila (2013) Pedestrian path prediction with recursive bayesian filters: a comparative study. In GCPR, Cited by: §2.0.3.
  • [29] C. Schöller, V. Aravantinos, F. Lay, and A. Knoll (2019) What the constant velocity model can teach us about pedestrian motion prediction. External Links: 1903.07933 Cited by: §2.0.2.
  • [30] X. Shi, X. Shao, Z. Guo, G. Wu, H. Zhang, and R. Shibasaki (2019-03) Pedestrian trajectory prediction in extremely crowded scenarios. Sensors 19 (5), pp. 1223. External Links: ISSN 1424-8220, Link, Document Cited by: §2.0.2.
  • [31] B. Singh, T. K. Marks, M. Jones, O. Tuzel, and M. Shao (2016-06) A multi-stream bi-directional recurrent neural network for fine-grained action detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.0.1.
  • [32] V. Veeriah, N. Zhuang, and G. Qi (2015) Differential recurrent neural networks for action recognition. External Links: 1504.06678 Cited by: §2.0.1.
  • [33] N. Wojke, A. Bewley, and D. Paulus (2017) Simple online and realtime tracking with a deep association metric. In 2017 IEEE International Conference on Image Processing (ICIP), pp. 3645–3649. External Links: Document Cited by: §3.3.
  • [34] N. Wojke and A. Bewley (2018) Deep cosine metric learning for person re-identification. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 748–756. External Links: Document Cited by: §3.3.
  • [35] P. Zhang, W. Ouyang, P. Zhang, J. Xue, and N. Zheng (2019) SR-lstm: state refinement for lstm towards pedestrian trajectory prediction. External Links: 1903.02793 Cited by: §2.0.2.
  • [36] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian (2016) MARS: a video benchmark for large-scale person re-identification. In ECCV, Cited by: §3.3.