Deep Learning-based Vehicle Behaviour Prediction For Autonomous Driving Applications: A Review

12/25/2019 ∙ by Sajjad Mozaffari, et al. ∙ University of Warwick 11

Behaviour prediction function of an autonomous vehicle predicts the future states of the nearby vehicles based on the current and past observations of the surrounding environment. This helps enhance their awareness of the imminent hazards. However, conventional behaviour prediction solutions are applicable in simple driving scenarios that require short prediction horizons. Most recently, deep learning-based approaches have become popular due to their superior performance in more complex environments compared to the conventional approaches. Motivated by this increased popularity, we provide a comprehensive review of the state-of-the-art of deep learning-based approaches for vehicle behaviour prediction in this paper. We firstly give an overview of the generic problem of vehicle behaviour prediction and discuss its challenges, followed by classification and review of the most recent deep learning-based solutions based on three criteria: input representation, output type, and prediction method. The paper also discusses the performance of several well-known solutions, identifies the research gaps in the literature and outlines potential new research directions.



There are no comments yet.


page 1

page 2

page 3

page 6

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Adoption of autonomous vehicles in the near future is expected to reduce the number of road accidents and improve road safety [44]

. However, for safe and efficient operation on roads, an autonomous vehicle should not only understand the current state of the nearby road-users, but also proactively anticipate their future behaviour (a.k.a. motion or trajectory). One part of this general problem is to predict the behaviour of pedestrians (or generally speaking, the vulnerable road-users), which is well-studied in computer vision literature 

[22, 1, 20, 17]. There are also several review papers on pedestrian behaviour prediction such as  [45, 23, 5]. Another equally important part of the problem is prediction of the intended behaviour of other vehicles on the road. In contrast to pedestrians, vehicles’ behaviour is constrained by their higher inertia, driving rules and road geometry, which could help reduce the complexity of the problem, compared to aforementioned problem. Nonetheless, new challenges arise from interdependency among vehicles behaviour, influence of traffic rules and driving environment, and multimodality of vehicles behaviour. Practical limitations in observing the surrounding environment and the required computational resources to execute prediction algorithms also add to the difficulty of the problem, as explained in the later sections of this paper.

There are several published survey papers on vehicle behaviour analysis. For example, M. S. Shirazi and B. T. Morris [50] provide a review of vehicle monitoring, behaviour and safety analysis at intersections. B. T. Morris and M. M. Trivedi [53]

review unsupervised approaches for vehicle behaviour analysis with a focus on trajectory clustering and topic modelling methods. Anomaly detection techniques using visual surveillance are reviewed in 

[31]. In the most related paper to our work, Lefevre et al. [34]

provide a survey on vehicle behaviour prediction and risk assessment in the context of autonomous vehicles. The authors review various conventional approaches that applied physics-based models and/or traditional machine learning algorithms such as Hidden Markov Models, Support Vector Machines, and Dynamic Bayesian Networks. Recent advances in machine learning techniques (e.g., deep learning) have provided new and powerful tools for solving the problem of vehicle behaviour prediction. Such approaches have become increasingly important due to their superiority in complex and realistic scenarios. However, to the best of our knowledge, there is no systematic and comparative review of the latter deep learning-based approaches. We thus present a review of such studies using a new classification method which is based on three criteria: input representation, output type, and prediction method. In addition, we report the practical limitations of implementing recent solutions in autonomous vehicles. To make the paper self-contained, we also provide a generic problem definition for vehicle behaviour prediction.

The rest of this paper is organised to a number of sections: Section II is an introduction to the basics and the challenges of vehicle behaviour prediction for autonomous vehicles. The definition of widely used terminologies and the generic problem formulation are also given in section II. Section III

reviews the related deep learning-based solutions and classifies them based on three criteria: input representation, output type, and prediction method. Section


discusses the commonly used evaluation metrics, compares the performance of several well-known trajectory prediction models in public highway driving datasets, and highlights the current research gaps in the literature and potential new research directions. The key concluding remarks are given in section


Ii Basics and Challenges of Vehicle Behaviour Prediction

Object detection and behaviour prediction can be considered as two main functions of the perception system of an autonomous vehicle. While both of them rely on on and off-board sensory data, the former aims to localize and classify the objects in the surrounding environment of the autonomous vehicle and the latter provides an understanding of the dynamics of surrounding objects and predicts their future behaviour. Behaviour prediction plays a pivotal role in autonomous driving applications as it supports efficient decision making [57] and enables risk assessment [34]. In this section, we firstly discuss the challenges of vehicle behaviour prediction, then we provide a terminology for vehicle behaviour prediction, and finally we present a generic probabilistic formulation of the problem.

Ii-a Challenges

Vehicles (e.g., cars and trucks) have well-structured motions which are governed by driving rules and environment conditions. In addition, vehicles cannot change their trajectories instantly due to their high inertia. However, vehicle behaviour prediction is not a trivial task due to several challenges. First, there is an interdependency among vehicles behaviour where the behaviour of a vehicle affects the behaviour of other vehicles and vice versa. Therefore, predicting the behaviour of a vehicle requires observing the behaviour of surrounding vehicles. Second, road geometry and traffic rules can reshape the behaviour of vehicles. For example, placing a give-way sign in an intersection can completely change the behaviour of vehicles approaching it. Therefore, without considering traffic rules and road geometry, a model trained in a specific driving environment would have limited performance in other driving environments. Third, the future behaviour of vehicles is multi-modal, meaning that given history of motion of a vehicle, there may exist more than one possible future behaviour for it. For example, when a vehicle is slowing down at an intersection without changing its heading direction, both turning right and turning left motions could be expected. A comprehensive behaviour prediction module in an autonomous vehicle should identify all possible future motions to allow the vehicle to act reliably.

In addition to the intrinsic challenges of the vehicle behaviour prediction problem, implementing a behaviour prediction module in autonomous vehicles comes with several practical limitations. For example, autonomous vehicles can partially observe the surrounding environment due to sensor limitations (e.g., object occlusion, limited sensor range, and sensor noise). In addition, there are restricted computational resources for on-board implementation in autonomous vehicles. Therefore, certain assumptions need to be considered in defining behaviour prediction problem for applications in autonomous driving.

Ii-B Terminology

To define the problem of vehicle behaviour prediction, we adopt the following terms:

  • Target Vehicles (TVs) are the vehicles whose behaviour we are interested in predicting. Most previous works assumed the autonomous vehicle predicts the behaviour of one TV at a time.

  • Ego Vehicle (EV) is the autonomous vehicle which observes the surrounding environment to predict the behaviour of TVs.

  • Surrounding Vehicles (SVs) are the vehicles in the vicinity of the TVs whose behaviour directly affects TVs’ behaviour. With this definition, the EV can be also considered as a SV, if it is close enough to TVs.

  • Non Effective Vehicles (NVs) are the vehicles that we assume to have no impact over the TV’s behaviour.

Figure 1 illustrates an example of a driving scenario using the proposed terminology.

Fig. 1: An illustration of the proposed terminology and limited observability of the EV. The vehicles which are not observable are blurred. For example, the preceding vehicle of the TV, which is not observable by the EV, is changing its lane. This lane change allows the TV to accelerate.

Ii-C Generic Problem Formulation

We use a probabilistic formulation for the vehicle behaviour prediction to cope with the uncertain nature of the problem. We represent the future behaviour of TVs as the trajectories they will traverse in the future, defined as:


Where represents the Cartesian coordinates of vehicle in the XY-plane at time step , is the number of TVs, and is the length of the prediction window.

The generic problem is formulated as computing the conditional distribution , where

is the EV’s raw/processed sensory data. Most of previous studies assumes having access to an unobstructed top-down view of the driving environment which can be obtained by infrastructure sensors (e.g. an infrastructure surveillance camera). Such data can be available if the infrastructure shares its sensory data with the EV (e.g., through V2I communication). In addition, it is not cost-effective to cover all road sections with such sensors. Therefore, a behaviour prediction module can not always rely on infrastructure sensory data. We assume the EV only has access to the data from on-board sensors such as cameras, lidars, and radars. The EV may also exploit states (e.g., bounding boxes, location, velocity, and etc) of TVs and SVs, which are usually estimated by the object detection module of an autonomous vehicle, and map information, which is usually provided for autonomous vehicles before starting a trip. Nonetheless, the EV may not have a full observability of the surrounding environment of TVs, due to sensor impairments (e.g., occlusion) and noise. This is needed to be considered when designing a behaviour prediction module for autonomous vehicles.

The distribution is a mutual distribution over trajectories of several interdependent vehicles, where each trajectory is a series of continuous variables, making it intractable. To reduce the computational requirement of estimating , most of previous works dropped the interdependency among vehicles future behaviour. As such, the behaviour of each TV can be predicted separately with an affordable computational requirement. At each step, one vehicle is selected as the TV and its is calculated, where:


is the current TV.

Iii Classifications of Previous Works

Lefevre et al. [34] classifies vehicle behaviour prediction models to physics-based, manoeuvre-based, and interaction-aware models. The approaches that use dynamic or kinematic models of vehicles as a prediction method are classified in physics-based models. The models that predict vehicles’ manoeuvres as output are considered as maneuvre-based approaches. Finally, the models that consider interaction of vehicles in the input are called interaction-aware models. One drawback of this classification is that it does not explain about other prediction methods rather than physics-based and other output types rather than maneuvre-based. It also does not specify how interaction is modeled in the input data (or generally speaking how the input is represented).

To overcome the mentioned drawbacks, we present three classifications based on three different criteria: input representation, output type, and prediction method. First, we classify previous studies based on how they represent the input data. In this classification, the interaction-aware models are divided into three classes. In second classification, maneuvre-based models (intention class) as well as other types of prediction output are explained. We do not include physics-based approaches as they are no longer state-of-the-art, but different deep learning methods used in behaviour prediction are discussed and classified in the last classification. Figure 2 provides the classes and sub-classes for each aforementioned classification criterion. The following subsections address the classification based on each criterion individually.

Fig. 2: The Three Proposed Classifications for Vehicle Behaviour Prediction

Iii-a Input Representation Criterion

In this subsection, we provide a classification of previous studies based on the type of input data and how it is represented. We divide them into four classes: track history of TV, track history of TV and SVs, simplified bird’s eye view, and raw sensor data. The last three classes can be considered as sub-classes of interaction-aware approaches which was introduced in [34]. We also discuss the availability of these input data in autonomous driving applications.

Iii-A1 Track history of the TV

The conventional approach for predicting behaviour of the TV is to only use its current state (e.g. position, velocity, acceleration, heading) or track history of its states over time. This feature can be estimated based on the EV’s observation if the TV is observable by the EV’s sensors.

In [58, 59, 60], the track history of x/y position, speed, and heading of the TV are used to predict its behaviour at different road junctions. All these works study the behaviour of the TV in an environment without any SVs. Few deep learning-based methods use this input set to predict the vehicle behaviour in a driving environment with presence of other vehicles [56, 42]. Long Xin et al. [56] argue that the information of SVs is not available due to EV’s sensor limitations and object occlusion; however, some of the SVs can usually be observed by EV’s sensor (see Figure 1). Excluding the observable SV’s state from the input set may result in inaccurate prediction of the TV’s behaviour due to interdependencies of vehicles’ behaviour.

Although the track history of the TV has highly informative features about its short-term future motion, relying only on the TV’s track history can lead to erroneous results particularly in long-term prediction in crowded driving environments.

Iii-A2 Track history of TV and SVs

One approach to consider the interaction among vehicles is to explicitly feed the track history of TV and SVs to the prediction model. SVs’ states, similar to the TV’s states, can be estimated in the object detection module of the EV; however, some of SVs can be outside of perception horizon of the EV or they might be occluded by other vehicles on the road.

The previous studies vary in how to divide the vehicles in the scene into surrounding vehicles (SVs) and non-effective vehicles (NVs). In [43, 10, 9], history of states of the TV and six of its closest neighbours are exploited to predict the TV’s behaviour. In [26, 13], the three closest vehicles in the TV’s current lane and two adjacent lanes are chosen as reference vehicles. The reference vehicles and four vehicles in front and behind of the two reference vehicles in adjacent lanes are selected as the SVs. F. Altché and A. de La Fortelle [2] consider nine vehicles in three lanes surrounding the target vehicle including two vehicles in front of TV. They indicate that considering more vehicles in the input data can improve the performance of behaviour prediction. For example, in a traffic jam, knowing that the second vehicle ahead of the TV is accelerating can enable early prediction of speed increase for the TV. Instead of considering a fix number of vehicles as the SVs, a distance threshold is defined to divide vehicle into the SVs and NVs in [36, 14, 12]. Therefore, only the interactions of vehicles within this threshold are modeled in the prediction model.

One drawback of previous studies is that they assume that the states of all surrounding vehicles are always observable, which is not a practical assumption in autonomous driving applications. A more realistic approach should always consider sensor impairments like occlusion and noise in exploiting the features of SVs. In addition, relying only on the track history of the TV and SVs is not sufficient for behaviour prediction, because other factors like environment conditions and traffic rules can also modify the behaviour of vehicles.

Iii-A3 Simplified Bird’s Eye Biew

An alternative way to consider the interaction among vehicles is by exploiting a simplified Bird’s Eye View (BEV) of the environment. In this approach, static and dynamic objects, road lane, and other elements of the environment are usually depicted with a collection of polygons and lines in a BEV image. The result is a map-like image which preserves the size and location of objects (e.g. vehicles) and the road geometry while ignoring texture information.

D. Lee et al. [32] fuse front-facing radar and camera data to form a binary two-channel BEV image covering the frontal area of the EV. One of the image channels specifies whether the pixel is occupied by a vehicle or not, and the other depicts the existence of lane marks. For past frames, the images are produced and stacked together and form a 2n-channel image as the input to the prediction model. Instead of using a sequence of binary images, which indicates the existence of objects over time, a single BEV image is used in  [8, 15]. In this image, each element of the scene (e.g., road, crosswalks) loses its actual texture and instead is colour coded according to its semantics. The vehicles are depicted by colour-coded bounding boxes and the location history of vehicles are plotted using bounding boxes with same colour and reduced level of brightness.

To enrich the temporal information within the BEV image, N. Deo and M. M. Trivedi [11]

use a social tensor which was first introduced in 

[1] (known as social pooling layer). A social tensor is a spatial grid around the target vehicle that the occupied cells are filled with the processed temporal data (e.g., LSTM hidden state value) of the corresponding vehicle. Therefore, a social tensor contains both the temporal dynamic of vehicles represented and spatial inter-dependencies among them. N. Lee et al. [33] use social pooling layer as an additional input to another BEV representation which is created by performing semantic segmentation on front-facing camera of the EV and transforming it to the BEV.

The aforementioned works do not consider sensor impairment in the input representation. To overcome this drawback, a dynamic occupancy grid map (DOGMa [41]) is exploited in [25, 49]

. DOGMa is created from the data fusion of a variety of sensors and provides a BEV image of the environment. The channels of this image contain the probability of occupancy and velocity estimate for each pixel. The velocity information helps distinguish between static and dynamic objects in the environment; however, it does not provide complete knowledge about the history of dynamic objects.

One advantage of simplified BEV is that it is flexible in terms complexity of representation. Thus, it can match applications with different computational resource constraints. Furthermore, the data gathered from different sensors can be fused into a single BEV representation. The drawback of this input set is that it does not include the scene texture information.

Class Advantages Disadvantages Works Summary
Track History of
the TV
- Complies with limited
observability of the EV
- Does not consider the
impact of environment
and interaction among
vehicles on TV’s behaviour
 [58, 60],
 [59, 56],
Track history of the TV’s states
(e.g. position, velocity, heading, and etc.)
Track History of
TV and SVs
- Considers the impact of
interaction among vehicles on
TV’s behaviour.
- Does not consider the
impact of environment on
TV’s behaviour
- The States of SVs are not
always observable
 [10, 43],
History of states for the TV and six SVs.
 [26, 13]
History of states for the TV and three reference
vehicles and four adjacent vehicles to them.
 [2] History of states for the TV and nine SVs.
 [36, 14],
A distance threshold is defined to divide
vehicles into the SVs and NVs
Simplified Bird’s
Eye View
- Considers the impact of
environment and interaction
among vehicles on TV’s
- Facilitates fusing the data
gathered from different sensors
on the EV.
- Flexible in terms of
complexity of representation
- It can comply with limited
observability of the EV
- Potentially useful
information in scene texture
is lost
A sequence of 2 channel top-down image
covering the environment in front of the TV.
It indicates the existence of vehicles and
lane lines over time.
 [8, 15]
A BEV image of environment, in which
the road elements and vehicles are represented
with color-coded polygons and lines.
A top-down grid representation. Each occupied
cell is filled with the corresponding vehicle’s
LSTM hidden state. (similar to [1])
Semantic segmentation of environment in
Bird’s eye view
 [25, 49]
A top-down grid representation. Each cell
contains the probability of the cell occupation,
and its velocity.
Raw Sensor
- Complies with limited
observability of the EV.
- No information loss
- High computational cost  [37] 3D point clouds data over several time steps
lidar data and rasterized map (the representation
used in [15])
TABLE I: Summary of Classification of previous works based on input representation and the advantages/disadvantages of each class

Iii-A4 Raw sensor data

In this approach, raw sensor data is fed to the prediction model. Thus, the input data contains all available knowledge about the surrounding environment. This allows the model to learn extracting useful features from all available sensory data.

Raw sensor data, compared to previous input representations, has larger dimension. Therefore, more computational resources are required to process the input data, which can make it impractical for on-board implementation in autonomous vehicles. One solution to this problem is to share the computational resources among different functions of autonomous vehicle. In deep learning literature, it is common to train a model for multiple tasks [46]. In an autonomous vehicle, the object detection module exploits raw sensor data, and it usually relies on a model with millions of parameters [3]. Thus, it can be a candidate for parameter sharing with behaviour prediction module.

W. Leo et al. [37]

use a deep neural network to jointly solve the problems of 3D detection, tracking, and motion forecasting for autonomous vehicles. They exploit 3D point clouds data over several time frames. The data is represented in bird’s eye view, and the height is considered as the channel dimension. To exploit the lidar data, the same approach is used by 

[6]; however, they feed the 3D point cloud data in addition to a simplified BEV to their deep model.

Table I provides a summary of classification of previous studies based on input representation. It also summarizes the advantages and disadvantages of each class.

Iii-B Output Type Criterion

In this subsection, we classify previous studies based on how they represent a vehicle future behaviour in the output of their prediction model. We consider four classes: intention, unimodal trajectory, intention-based trajectory, and multimodal trajectory.

Iii-B1 Intention

Vehicle intention (a.k.a. manoeuvre) prediction is the task of classifying the future behaviour of vehicles using a set of pre-defined discrete classes. For example, in highway driving, the set of classes could be left lane change, right lane change, and keeping the lane; while in an intersection, the set of classes could be: go straight, turn left, and turn right.

To predict the intention of a vehicle approaching a T-junction, A. Zyner et al. [59] define three classes based on the destination of the vehicle, namely ”east”, ”west”, or ”south”. In [58], the same set of classes are used to predict the intention of a vehicle at an un-signalized roundabout. D. J. Phillips et al. [43] design a generalizable intention prediction model that can predict the direction of travel of a vehicle up to 150m before reaching three- and four-way intersections. W. Ding et al. [13] and D. Lee et al. [32] apply intention prediction to highway driving scenario. The former proposes an intention prediction model to predict lane change and lane keeping behaviour for the TV; while, the latter designs a model to predict the cut-in intention of right/left preceding TVs w.r.t. the EV.

Previous studies predict the intention of vehicles using a set of few classes. On drawback of these works is that they can only provide a high-level understanding of the vehicle behaviour. This problem can be solved by subdividing high-level manoeuvres into sub-classes that describe the behaviour more precisely. For example, in a highway driving scenario, we can subdivide lane change classes into sharp lane change and normal lane change. Another drawback is the specificity of manoeuvre set to single driving environment, which can be resolved by defining a set that contains the manoeuvres in all desired driving scenarios. However, to predict a vehicle behaviour using large and in depth set of classes, a larger and more diverse training dataset that includes sufficient samples in each class is required. In addition, larger model capacity is needed to learn the mapping of the input data to the intention set.

Iii-B2 Unimodal trajectory

Trajectory prediction models describe the future behaviour of a vehicle by predicting series of future locations of the TV over a time window. Dealing with continuous output of trajectory prediction models can add more complexity to the problem compared to discrete output of intention prediction models. However, predicting trajectory instead of intention, provides more precise information about future behaviour of vehicles. Given a specific driving situation and history of motion for a vehicle, it might be possible for it to traverse multiple different trajectories. Therefore, the corresponding distribution has multiple modals. Unimodal trajectory predictors are the models that only predict one of these possible trajectories (usually the one with highest likelihood).

The straightforward approach to predict the trajectory of the TV is to estimate the position of it over time [27, 36]. The predictor model can also estimate the displacement of the TV relative to its last position at each step [12, 9]. The other approach used in [2] is to predict lateral position and longitudinal velocity separately. This approach can be specially useful when the region of interest is longitudinally large, therefore longitudinal position can be a quite large figure. In addition to the position and velocity, the heading angle of the vehicle is predicted in [37]. To cope with uncertainty of the trajectory prediction problem, Nemanja Djuric et al. [15]

propose a trajectory prediction model that estimates standard deviation for the predicted x- and y-positions.

The main disadvantage of aforementioned methods is that they do not fully represent the vehicle behaviour prediction space which can have more than one modality. Furthermore, their models may converge to the average of all the possible modals because the average can minimize the displacement error of unimodal trajectory prediction; however, the average of modals is not necessarily a valid future behaviour [60]. Figure 3 illustrates this problem.

Fig. 3: An illustration of invalidity of the average of manoeuvres. The red car is approaching the green car on the road. It is probable for the red car to either reduce its speed (green dots) or change its lane (blue dots). A unimodal trajectory predictor may predict an average of these two manoeuvres (red dots) to reduce the prediction error. However, the average of these two manoeuvres is not a valid manoeuvre since it results in a collision with the preceding vehicle.

Iii-B3 Intention-based trajectory

To deal with multimodal nature of vehicle behaviour prediction, One approach is to estimate the likelihood of each member of a predefined intention (i.e. policy/manoeuvre/behaviour modal) set and predict the trajectory that corresponds to the most probable intention.

Long Xin et al. [56] propose an intention-aware model to predict trajectory based on estimated lane change intention for the TV in highway driving. In [14, 6], the intention set is extended from only lane changes intentions to turning right, turning left, stopping, and so on. This allows using the prediction model in urban driving.

Unlike unimodal trajectory predictors, intention-based trajectory prediction approaches are unlikely to converge to the mean of modals, as in these approaches the predicted trajectory corresponds to one of predefined behaviour modals. However, there are two main drawbacks in these approaches. First, intention-based trajectory predictors cannot accurately predict a vehicle trajectory if the vehicle’s intention does not exist in the predefined intention set. This problem can commonly occur in complex driving scenarios, as it is hard to predetermine all possible driving intentions in such environments. Second, unlike unimodal trajectory prediction, we need to manually label the intention of vehicles in the training dataset, which is time-consuming, expensive and error-prone.

Class Advantages Disadvantages Work Summary of Output Type
- Usually has low
computational cost
- Only provides a high-level
understanding of the vehicle behaviour
- Usually covers manoeuvres that are
specifically defined for a single driving
The destination of travel at roundabout and T-junction.
The probabilities of turning right, left, and going straight at
Lane change behaviour of the TV in highway driving
 [32] Right/left cut-in of left/right preceding TVs.
- Less computational
cost compared to
multimodal models
- Does not fully represent the vehicle
behaviour prediction space which is
- Prone to converge to the mean of
behaviour modals which itself might
not be a valid prediction.
The position of the TV(s) over time.
The displacement of the TV relative to its last position for
each step.
 [2] Longitudinal velocity and lateral position over time.
 [15] The x-y position and the standard deviation.
The bounding box (e.g. location and heading angle) of the
- Fix the problem of
converging to mean
of behaviour modals
- Prone to trajectory prediction error
if the vehicle’s intention is not among
pre-defined intentions.
- Manual labelling is required
The TV’s trajectory (based on lane change estimation for
highway driving)
The TV’s trajectory (based on an intention estimation for
urban driving)
- Solves the disadvantages
of intention-based
trajectory prediction
models (Dynamic intention
- Prone to converge to one behaviour
modal and not to explore all behaviour
modals (Dynamic intention sets)
- High computational cost (Both
Static Intention Set: the trajectory distribution per each of
six predefined manoeuvres and their probability
Dynamic Intention Set: a number of samples from the
estimated distribution of trajectory
Dynamic Intention Set: the probability of occupancy for
each cell of BEV grid map of surrounding environment.
Dynamic Intention Set: a number of deterministic trajectories
sequences and their probabilities
TABLE II: Summary of classification of previous works based on output type and the advantages/disadvantages of each class

Iii-B4 Multimodal trajectory

Multimodal trajectory prediction models predict one trajectory per behaviour modals. They also estimate the probability for each of the modals. We divide multimodal prediction approaches into two sub-categories:

  • [leftmargin=0cm, itemindent=0.8cm, labelwidth=labelsep=-0.3cm,align=left]

  • Static Intention set: In this sub-class, a set of intentions is explicitly defined and the trajectories are predicted for each member of this set. In [11, 10], a set of six manoeuvre classes for highway driving is defined and the trajectory distribution for each manoeuvre class is predicted. Predicting the distribution allows them to model the uncertainty of trajectory prediction for each manoeuvre separately.Their model also predicts the likelihood of each manoeuvre.

  • Dynamic Intention set: In these approaches, the intention set can be dynamically learnt by the trajectory prediction model based on the driving scenario. Henggang Cui et al. [8] develop a model that predicts a fixed number of deterministic trajectory sequences and their probabilities. Each of these sequences can correspond to a possible manoeuvre in the driving environment. In [33, 60, 42], the distribution of vehicles’ trajectory is modeled. Then, a fixed number of trajectory sequences are sampled from the modeled distribution and ranked based on their likelihood. In [25, 49], the trajectory is predicted by estimating the TV’s occupancy likelihood for each cell in the dynamic occupancy grid map (DOGMa [41]) and each time-step in prediction horizon. They create DOGMa by assigning a grid map to a bird’s eye view of the environment around the EV. Their model can dynamically predict multiple trajectory modals by assigning high probability to separate groups of cells in front of a TV.

The first sub-category of multimodal approaches can be considered as an extension to intention-based trajectory prediction approaches. The only difference is that the multimodal trajectory predictors with static intention set predict the trajectories for all the behaviour modals rather than the modal with highest likelihood. Therefore, the drawbacks we mentioned for intention-based trajectory prediction approaches, namely difficulties in defining a comprehensive intention set and manual labeling of intentions in the training dataset, are not solved here. In contrast, the approaches in the second sub-category are exempted from these two problems as they do not require a pre-defined intention set. However, due to dynamic definition of manoeuvres, they are prone to converge to a single manoeuvre [60] or not being able to explore all the existing manoeuvres.

Table II provides a summary of classification of previous studies based on output type. It also summarizes the advantages and disadvantages of each class.

Iii-C Based on Prediction Method

In this subsection, we classify previous studies based on the prediction model used into three classes, namely recurrent neural networks, convolutional neural networks, and other methods.

Class Work Summary of Prediction Method
 [59, 58, 43] Single RNN: Multi-layer LSTM network is used as a sequence classifier.
 [14] Single RNN: Two-layer LSTM is used to predict the parameters of acceleration distribution.
 [2] Single RNN: Single-layer LSTM is used to predict future x-y position of TV
 [42] Single RNN: An encoder decoder LSTM is used to predict the probability of the occupancy on a grid BEV
 [13] Multiple RNNs: A group of GRUs is used model the pairwise interaction between the TV and each of SVs
Multiple RNNs: One group of LSTMs is used to model individial vehicles’ trajectory, another group is used
to model pairwise interaction
Multiple RNNs: One LSTM is used to estimate the target lane, another LSTM is used to predict trajectory
based on estimated target lane

Multiple RNNs: Multi-layer LSTM are used to predict mixtures of Gaussian distribution.

Multiple RNNs: On Lstm encoder is applied to the input sequence. The hidden state is feeded to six LSTM
decoders (one per maneuver). Another LSTM encoder is used to predict the probability of each manoeuvre.
six layer CNN with convolution and fully connected layers are used to predict the intention of surrounding
 [8, 15] MobileNetV2 [47] is used as feature extractor.
 [25] A convolution-deconvolution architecture is used to predict vehicle behaviour. It is introduced in [40]
First 3D convolutions are applied to tempral dimension of input data. Then, a series of 2D convolution are used to
capture spatial features. Finally, two branches of convolution layers are used to find the probability of being a
vehicle and predict the bounding box over current and future frames.
First two backbone CNNs are used to extract the features of lidar data and rasterized map separetely. Then three
different network are applied to concatenation of extracted features to generate detection, intention and trajectory
Other Methods  [26]
Fully-connected Neural Networks: parameters of vehicle behaviour distribution are estimated using multi-layer
fully-connected network
Combination of RNNs and CNNs: An LSTM is applied on each vehicle trajectory. The result is represented in BEV
grid structure and then are fed to a CNN. The output is fed to six LSTM decoder (one per manoeuvre)
Combination of RNNs and CNNs: A convolution network extracts spatial features from the input image. These
features are fed to encoder-decoder LSTM. The result is fed to deconvolution network to map to output image with
the same size as input.
Combination of RNNs and CNNs: CVAE-based encoder-decoder GRU generates the trajectory distribution. A number
of samples from this distribution are ranked and refined based on contextual features.
Graph Neural Networks: Graph Convolutional Network(GCN [28]) and Graph Attention Network (GAT [55])
are used with some adaptations.
Graph Neural Networks: Graph Convolutional Model is used which consists of several convolutional and graph
operation layers.
TABLE III: Summary of classification of previous works based on the prediction method and the advantages/disadvantages of each class

Iii-C1 Recurrent neural networks

The simplest recurrent neural network (a.k.a. Vanilla RNN) can be considered as an extension to two-layer fully-connected neural network where the hidden layer has a feedback. This small change allows to model sequential data more efficiently. At each sequence step, the Vanilla RNN processes the input data from current step alongside the memory of past steps, which is carried in the previous hidden neurons. A vanilla RNN with sufficient number of hidden units can, in principle, learn to approximate any sequence to sequence mapping 

[21]. However, it is difficult to train this network to learn long sequences in practice due to gradient vanishing or exploding, which is why gated RNNs are introduced [19]

. In these networks, instead of a simple fully connected hidden layer, a gated architecture is used for processing sequential data. Long short-term memory (LSTM) 


and Gated recurrent unit (GRU) 

[7] are the most commonly used gated RNNs. In vehicle behaviour prediction, LSTMs are the most used deep models. Here, we sub-categorize recent studies based on the complexity of network architecture:

  • [leftmargin=0cm, itemindent=0.8cm, labelwidth=labelsep=-0.3cm,align=left]

  • Single RNN: In these models, either a single recurrent neural network is used in the simplest form of behaviour prediction (e.g., intention prediction or unimodal trajectory prediction) or a secondary model is used alongside a single RNN to support more sophisticated features like interaction-awareness and/or multimodal prediction. To predict the intention of vehicles, a LSTM is used by [59, 58, 43] as a sequence classifier. In this task a sequence of features is fed to successive cells of a LSTM. Then, the hidden state of the last cell in the sequence is mapped to output dimension, which is the number of defined classes. In [59, 58], the input is embedded using a fully-connected layer and is fed to a three-layer LSTM; while, a two-layer LSTM without embedding is used in [43]. F. Altché and A. de La Fortelle [2] use a single layer LSTM to predict the future x-y position of the TV as a regression task. Despite having less parameters and complexity, single layer LSTMs are reported to achieve competitive results compared to the multilayer counterpart in some tasks [38, 29]. To predict an intention-based trajectory, W. Ding and S. Shen [14] use a LSTM encoder to predict the intention of the TV using its states. Then, the predicted intention and map information are used to generate an initial future trajectory for the TV. Finally, a nonlinear optimization method is used to refine the initial future trajectory based on the vehicles interaction, traffic rules (e.g. red lights), and road geometry. To predict multimodal behaviour, A. Zyner et al. [60]

    first use an encoder-decoder three-layer LSTM to predict the parameters of a weighted Gaussian Mixture Model (GMM) for each step of the future trajectory. Then, a clustering approach is used to extract the trajectories that correspond to the modals with highest probabilities. Seong Hyeon Park

    et al. [42] use an encoder decoder LSTM to predict the probability of occupancy on a grid map and apply a beam search algorithm [39] to select most probable future trajectory candidates.

  • Multiple RNNs: To deal with multimodality and/or interaction awareness within recurrent neural networks, usually an architecture of several RNNs are used in previous studies. W. Ding et al. [13] use a group of GRU encoders to model the pairwise interaction between the TV and each of SVs, based on which the intention of the TV is predicted for a longer horizon. S. Dai et al. [9] use two groups of LSTM networks for the TV’s trajectory prediction, one group for modeling the TV and each of SVs individual trajectory and the other for modeling the interaction between the TV and each of the SVs. L. Xin et al. [56] exploit one LSTM to predict the target lane of the TV and another LSTM to predict the trajectory based on the TV’s states and the predicted target lane. To predict multimodal trajectories, N. Deo and M. M. Trivedi [10] use six different decoder LSTMs which correlate with six specific manoeuvres of highway driving. An encoder LSTM is applied to the past trajectory of vehicles. The hidden state of each decoder LSTM is initialized with the concatenation of the last hidden state of the encoder LSTM and a one-hot vector representing the maneuvre specific to each decoder. The decoder LSTMs predict the parameters of manoeuvre-conditioned bivariate Gaussian distribution for future location of vehicle. Another encoder LSTM is also used to predict the probability of each of six manoeuvres.

Iii-C2 Convolutional neural networks

Convolutional neural networks (CNNs) include convolution layers, where a filter with learnable weights is convolved over the input, pooling layers, which reduce the spatial size of input by subsampling, and fully-connected layers, which map their input to desired output dimension. CNNs are commonly used to extract features from image data. They have achieved successful results in the computer vision domain [30, 16]. This success motivates researchers in other domains to represent their data as an image to be able to apply CNNs on them [35]. However, recently one-dimensional CNNs are also widely used to extract features from one-dimensional signals [54].

D. Lee et al. [32] use a six-layer CNN to predict the intention of surrounding vehicles using a binary BEV representation. MobileNetV2 [47], which is a memory-efficient CNN designed for mobile applications, is used in [8, 15] to extract relevant features from a relatively complex BEV representation. S. Hoermann et al. [25] use a convolution-deconvolution architecture, which was previously introduced in [40] for image segmentation task, to output the probability of occupancy for future time steps in an BEV image. This model first generates a feature vector using a convolutional network. Then, a deconvolutional network is used to upscale this vector to the output image. A more complex architecture is used in [6, 37] to deal the tasks of object detection and behaviour prediction simultaneously. In [37], 3D convolution is performed on the temporal dimension of 4D representation of voxelized lidar data to capture temporal features, then a series of 2D convolutions are applied to extract spatial features. Finally, two branches of convolution layers are added to predict the bounding boxes over the detected objects for current and future frames and estimate the probability of being a vehicle for the detected objects, respectively. In [6], two backbone CNNs are used to separately process the BEV lidar input data and the rasterized map. The extracted features are concatenated and fed to three different networks to detect the vehicles, estimate their intention, and predict their trajectories.

Iii-C3 Other Methods

  • [leftmargin=0cm, itemindent=0.8cm, labelwidth=labelsep=-0.3cm,align=left]

  • Fully-connected Neural Networks:

    A simplistic approach for vehicle behaviour modelling is to rely only on the current state of the vehicles, which might be inevitable due to unavailability of states history of vehicles or first-order Markov assumption. In this case, the input data is not a sequence and any feed-forward neural networks (e.g. fully-connected neural network) can be used instead of RNNs. In 

    [4], it is shown that in some driving scenarios, feed-forward neural networks can have competitive results with faster processing time compared to recurrent neural networks. Yeping Hu et al. [26] use a multi-layer fully connected network to predict the parameters of a Gaussian Mixture Model (GMM). The GMM models the multimodal distribution of arriving time and final location for the TV.

  • Combination of RNNs and CNNs:

    In previous works, recurrent neural networks are used because of their temporal feature extracting power, and convolutional neural networks are used for their spatial feature extracting ability. This inspirs some researchers to use both in their models to process both the temporal and spatial dimensions of the data. D. Nachiket

    et al. [11] use one encoder-LSTM per vehicle to extract the temporal dynamics of the vehicle. The internal states of these LSTMs form a social tensor which is fed to a convolutional neural network to learn the spatial interdependencies. Finally, six decoder LSTMs are used to produce the manoeuvre-conditioned distribution for the future trajectory of the TV. In  [49], a CNN is applied on each simplified BEV frame representing of the environment around the TV. Then, the sequence of extracted features is fed to an Encoder-Decoder LSTM to learn the temporal dynamics of the input data. The decoder LSTM outputs are fed to a deconvolutional neural network to produce output images which represent how the environment around the TV will evolve in the following time steps. In [33]

    , an encoder-decoder GRU is used to generate the distribution of trajectories, then multiple samples of this distribution are fed to decoder GRU to refine and rank them. The latter module also receives the contextual features which are extracted by a CNN model applied on the scene representation.

  • Graph Neural Networks: The vehicles in a driving scenario and their interaction can be considered as a graph in which the nodes are the vehicles and the edges represent the interaction among them. Using this representation, Graph Neural Networks (GNNs) [18, 48] can be used to predict TV’s behaviour. F. Diehl et al. [12] compare the trajectory prediction performance of two state-of-the-art graph neural networks, namely, Graph Convolutional Network(GCN) [28] and Graph Attention Network (GAT) [55]. They also propose some adaptations to improve the performance of these networks for the vehicle behaviour prediction problem. X. Li et al. [36] propose a graph-based interaction-aware trajectory prediction (GRIP) model. They use a graph convolutional model, which consists of several convolutional layers as well as graph operations, to model the interaction among the vehicles. The output of the graph convolutional model is fed to an LSTM encoder-decoder to predict the trajectory for multiple TVs.

Table III provides a summary of classification of previous studies based on the prediction method.

Iv Evaluation

In this section, first we present evaluation metrics that are commonly used for vehicle behaviour prediction. Then, the performance for some of previous works is discussed. Finally, we identify and discuss the main research gaps and opportunities.

Iv-a Evaluation Metrics

We discuss the evaluation metrics for intention prediction models and trajectory prediction models separately, as the former is a classification problem and the latter is a regression problem and each problem has a separate set of metrics.

Work Classification RMSE
Input Representation Output Type Prediction Model 1 s 2 s 3 s 4 s 5 s
CV - - - 0.73 1.78 3.13 4.78 6.68
 [2] Track history of TV and SVs Unimodal Trajectory RNN (Single RNN) 0.72 2 3.76 5.97 9.01
 [56] Track history of TV Intention-based Trajectory RNN (Multiple RNNs) 0.49 1.41 2.6 4.06 5.79
M-LSTM [10] Track history of TV and SVs Multimodal Trajectory RNN (Multiple RNNs) 0.58 1.26 2.12 3.24 4.66
CS-LSTM [11] Simplified Bird’s Eye View Multimodal Trajectory Combination of RNNs and CNNs 0.61 1.27 2.09 3.1 4.37
ST-LSTM [9] Track history of TV and SVs Unimodal Trajectory RNN (Multiple RNNs) 0.56 1.19 1.93 2.78 3.76
GRIP [36] Track history of TV and SVs Unimodal Trajectory Graph Neural Networks 0.64 1.13 1.8 2.62 3.6
TABLE IV: Comparison of trajectory prediction error of the previous models for different prediction horizons

Iv-A1 Intention Prediction Metrics

  • [leftmargin=0cm, itemindent=0.8cm, labelwidth=labelsep=-0.3cm,align=left]

  • Accuracy: One of the most common classification metrics is accuracy which is defined as total number of correctly classified data samples divided by total number of data samples. However, relying only on the accuracy is sometimes misleading for an imbalanced dataset. For example, the number of lane changes in a highway driving dataset is usually much less than lane keeping. Thus, an intention predictor that regardless of input data always output lane keeping gains high accuracy score. Therefore, other metrics like precision, recall, and F1 score were also used in previous studies [43, 6].

  • Precision: For a given class, precision is defined as the ratio of total number of data samples which are correctly classified in that class to the total number of samples classified as the given class. A low precision indicates a large number of incorrectly classified data in the given class.

  • Recall: For a given class, recall is defined as the ratio of total number of data samples which are correctly classified in that class to the total number of samples in the given class. A low Recall indicates a large number of data in the given class that are incorrectly classified in other classes.

  • F1 Score:

    The F1 score (a.k.a. F-score or F-measure) is a balance between precision and recall and is defined as:

  • Negative Log Likelihood (NLL): For a multi-class classification task, NLL is calculated as:


    Where is a binary indicator of correctness of predicting the observation in class , is the predicted probability of belonging to class , and is the number of classes. Although NLL values are not as interpretable as previous metrics, it can be used to compare the uncertainty of different intention prediction models [13].

  • Prediction Horizon: Apart from the metrics that deal with the correctness of prediction, it is important to report how far in the future the prediction model can predict the driving intention of the TV. Prediction horizon can also be used to distinguish between manoeuvre prediction and detection problems. For example, a model that outputs lane change for the TV 1.0 s before it crosses the dividing line is considered a behaviour detection model rather than behaviour prediction model, since lane change behaviour commonly takes 3.0 to 5.0 s [51] and during this period the manoeuvre is clearly happening.

Iv-A2 Trajectory Prediction Metrics

  • [leftmargin=0cm, itemindent=0.8cm, labelwidth=labelsep=-0.3cm,align=left]

  • Final Displacement Error: This error measures the distance between predicted final location and true final location of the vehicle at the end of prediction horizon , while it does not consider the prediction error occurred in other time steps in the prediction horizon.

  • Mean Absolute Error (MAE) or Root Mean Squared Error (RMSE): MAE measures the average magnitude of prediction error , while RMSE measures the square root of the average of squared prediction error:


    Where is number of data samples and can be defined as the displacement error between the predicted trajectory and ground truth. MAE and RMSE are two of the most common metrics for regression problems and act roughly similar. However, RMSE is more sensitive to large errors due to usage of squared error in its definition.

  • Negative Log Likelihood (NLL): For a modelled trajectory distribution , NLL is calculated as:


    Where is the ground-truth trajectory. NLL can be reported as a metric in both intention prediction and trajectory prediction; however, in trajectory prediction this metric can be more important as both MAE and RMSE are biased in favor of models that predict the average of modals[11] which is not necessarily a good prediction, as discussed before. Therefore, NLL is also reported in previous studies to compare the correctness of underlying predicted trajectory distribution.

  • Computation Time: The trajectory prediction models are usually more complex compared to intention prediction models. Therefore, they can take more computation time which might make them impractical for on-board implementation in autonomous vehicles. Thus, it is crucial to report and compare computation time in trajectory prediction models.

Iv-B Performance of Existing Methods

In this part we compare the performance of some of reviewed trajectory prediction methods. The selected studies for comparison are the ones that used common publicly available datasets and common metrics. These studies report RMSE errors for prediction horizons of 1.0 to 5.0 s on NGSIM I-80 and US-101 highway driving datasets [52]. Table IV provides the reported error for each paper which is obtained from the original paper (except the RMSE calculation of [2] which has been modified by [56] to match the position error in SI units). Note that the RMSE error is reported for longitudinal and lateral position separately in [56]; however, we calculated the total RMSE error to be consistent with other studies. Furthermore, in [9]

the error is calculated for US-101 and I-80 separately; while, we report the average of them. We also report the prediction result of a constant velocity Kalman Filter(CV) model as a baseline which is obtained from 


To compare the performance of selected works, Table IV states the category each work belongs to. According to the table IV, most of deep learning-based methods surpass conventional methods like constant velocity model (CV) with a high margin. Among reviewed deep learning-based models, complex models (e.g. Multiple RNNs or Combination of RNNs and CNNs) achieve better performance compared to simple models like single RNN. Nonetheless, increasing the complexity of output, by predicting multimodal trajectory instead of unimodal trajectory, does not always result in lower RMSE. For example, the models named GRIP [36] and ST-LSTM [9] achieve better performance w.r.t. M-LSTM [10] and CS-LSTM [11], while the former studies predict unimodal trajectories and the latter ones predict multimodal trajectories. This can be due to limited model capacity or limited data used in the discussed multi-modal trajectory prediction models.

Iv-C Research Gaps and New Opportunities

We discuss some of the main research gaps in vehicle behaviour prediction problem, which can be considered as opportunities for future works:

  1. [leftmargin=0cm, itemindent=0.8cm, labelwidth=labelsep=-0.3cm,align=left]

  2. Unlike object detection which has unified way of evaluation [3], there is no benchmark for evaluating previous studies on vehicle behaviour prediction. Among the reviewed paper, there are only six works that use unified evaluation method (the works that we compare their performance in this paper). In addition, only a few works report the computation time of their algorithms, while this metric is highly important in autonomous driving applications. As a future work, a benchmark can be defined and used in vehicle behaviour prediction to be able to thoroughly compare the performance of different studies.

  3. Most of the previous works consider full observability of the surrounding environment and vehicles’ states which is not feasible in practice. Infrastructure sensors can provide non-occluded top-down view of the environment; however, it is impractical to cover all road sections with such sensors. Therefore, a realistic solution for behaviour prediction should always consider sensor impairments (e.g. occlusion, noise) which can limit the number of observable vehicles around target vehicle and in turn may reduce the accuracy of behaviour predictors in autonomous vehicles. One possible solution to this is the utilization of connected autonomous vehicles. In this case, the connected vehicle can exploit the information gained by sensors implemented in other vehicles or infrastructure through V2V and V2I communication (see Figure 4).

  4. In recent studies, traffic rules are rarely considered as an explicit input to the model; while, they can reshape the behaviour of a vehicle in a driving scenario. Some of the previous studies include road direction or traffic light as an input to their model [43, 14] which are only a small part of traffic signs and rules.

  5. In addition to the vehicle’s states and scene information which both are usually considered in recent works, other visual and auditory data of vehicles, like vehicle’s signalling lights and vehicle horn can also be used to infer about its future behaviour.

  6. Most of the previous works are limited to a specific driving scenario such as roundabout, intersection, and T-junction. However, a vehicle behaviour prediction module in fully autonomous vehicle should be able to predict the behaviour in any driving scenario. Developing a model which can be applied to a variety of driving environment can be a direction for future research.

Fig. 4: An illustration of the vehicle behaviour prediction problem for connected autonomous vehicles. The sensors implemented in other autonomous vehicles and infrastructure can provide more information about the surrounding vehicle and reduce the object occlusion problem in ego vehicle.

V Conclusion

Although deep learning-based behaviour prediction solutions have superior performance and are more sophisticated in terms of input representation, output type, and prediction method compared to conventional solutions, there are several open challenges that need to be addressed to enable their adoption in autonomous driving applications. Particularly, while most of existing solutions considered the interaction among vehicles, factors such as environment conditions and set of traffic rules are not directly inputted to the prediction model. In addition, practical limitations such as sensor impairments and limited computational resources have not been fully taken into account.


  • [1] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese (2016-06) Social lstm: human trajectory prediction in crowded spaces. Cited by: §I, §III-A3, TABLE I.
  • [2] F. Altché and A. de La Fortelle (2017-10) An lstm network for highway trajectory prediction. pp. 353–359. External Links: Document, ISSN 2153-0017 Cited by: 1st item, §III-A2, §III-B2, TABLE I, TABLE II, TABLE III, §IV-B, TABLE IV.
  • [3] E. Arnold, O. Y. Al-Jarrah, M. Dianati, S. Fallah, D. Oxtoby, and A. Mouzakitis (2019) A survey on 3d object detection methods for autonomous driving applications. IEEE Transactions on Intelligent Transportation Systems (), pp. 1–14. External Links: Document, ISSN 1524-9050 Cited by: §III-A4, item 1.
  • [4] M. Bahari and A. Alahi (2019) Feed-forwards meet recurrent networks in vehi-cle trajectory prediction. Cited by: 1st item.
  • [5] P. V. K. Borges, N. Conci, and A. Cavallaro (2013-11) Video-based human behavior understanding: a survey. IEEE Transactions on Circuits and Systems for Video Technology 23 (11), pp. 1993–2008. External Links: Document, ISSN 1051-8215 Cited by: §I.
  • [6] S. Casas, W. Luo, and R. Urtasun (2018-29–31 Oct) IntentNet: learning to predict intention from raw sensor data. , pp. 947–956. External Links: Link Cited by: §III-A4, §III-B3, §III-C2, TABLE I, TABLE II, TABLE III, 1st item.
  • [7] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder–decoder for statistical machine translation. pp. 1724–1734. Cited by: §III-C1.
  • [8] H. Cui, V. Radosavljevic, F. Chou, T. Lin, T. Nguyen, T. Huang, J. Schneider, and N. Djuric (2018) Multimodal trajectory predictions for autonomous driving using deep convolutional networks. CoRR abs/1809.10732. External Links: Link, 1809.10732 Cited by: 2nd item, §III-A3, §III-C2, TABLE I, TABLE II, TABLE III.
  • [9] S. Dai, L. Li, and Z. Li (2019) Modeling vehicle interactions via modified lstm models for trajectory prediction. IEEE Access 7 (), pp. 38287–38296. External Links: Document, ISSN 2169-3536 Cited by: 2nd item, §III-A2, §III-B2, TABLE I, TABLE II, TABLE III, §IV-B, §IV-B, TABLE IV.
  • [10] N. Deo and M. M. Trivedi (2018-06) Multi-modal trajectory prediction of surrounding vehicles with maneuver based lstms. pp. 1179–1184. External Links: Document, ISSN 1931-0587 Cited by: 1st item, 2nd item, §III-A2, TABLE I, TABLE II, TABLE III, §IV-B, TABLE IV.
  • [11] N. Deo and M. M. Trivedi (2018-06) Convolutional social pooling for vehicle trajectory prediction. Cited by: 1st item, 2nd item, §III-A3, TABLE I, TABLE II, TABLE III, 3rd item, §IV-B, TABLE IV.
  • [12] F. Diehl (2019-06) Graph neural networks for modelling traffic participant interaction. In

    IEEE Intelligent Vehicles Symposium 20192018 IEEE Intelligent Vehicles Symposium (IV)2018 21st International Conference on Intelligent Transportation Systems (ITSC)2013 IEEE International Conference on Acoustics, Speech and Signal Processing2013 IEEE Intelligent Vehicles Symposium (IV)2008 11th International IEEE Conference on Intelligent Transportation SystemsThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR)The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)2018 IEEE Intelligent Vehicles Symposium (IV)The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)2018 IEEE International Conference on Robotics and Automation (ICRA)2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC)The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops2017 IEEE Intelligent Vehicles Symposium (IV)Proceedings of The 2nd Conference on Robot Learning2017 IEEE Intelligent Vehicles Symposium (IV)2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC)2018 IEEE Intelligent Vehicles Symposium (IV)Distributed, Ambient and Pervasive Interactions: Technologies and ContextsProceedings of the 34th International Conference on Machine Learning - Volume 70Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)The IEEE International Conference on Computer Vision (ICCV)

    , A. Billard, A. Dragan, J. Peters, J. Morimoto, N. Streitz, and S. Konomi (Eds.),
    Proceedings of Machine Learning ResearchICML’17, Vol. 87. External Links: Link Cited by: 3rd item, §III-A2, §III-B2, TABLE I, TABLE II, TABLE III.
  • [13] W. Ding, J. Chen, and S. Shen (2019) Predicting vehicle behaviors over an extended horizon using behavior interaction network. CoRR abs/1903.00848. External Links: Link, 1903.00848 Cited by: 2nd item, §III-A2, §III-B1, TABLE I, TABLE II, TABLE III, 5th item.
  • [14] W. Ding and S. Shen (2019-03) Online Vehicle Trajectory Prediction using Policy Anticipation Network and Optimization-based Context Reasoning. arXiv e-prints, pp. arXiv:1903.00847. External Links: 1903.00847 Cited by: 1st item, §III-A2, §III-B3, TABLE I, TABLE II, TABLE III, item 3.
  • [15] N. Djuric, V. Radosavljevic, H. Cui, T. Nguyen, F. Chou, T. Lin, and J. Schneider (2018) Motion prediction of traffic actors for autonomous driving using deep convolutional networks. CoRR abs/1808.05819. External Links: Link, 1808.05819 Cited by: §III-A3, §III-B2, §III-C2, TABLE I, TABLE II, TABLE III.
  • [16] C. Farabet, C. Couprie, L. Najman, and Y. LeCun (2013-08) Learning hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (8), pp. 1915–1929. External Links: Document, ISSN 0162-8828 Cited by: §III-C2.
  • [17] T. Fernando, S. Denman, S. Sridharan, and C. Fookes (2018) Soft + hardwired attention: an lstm framework for human trajectory prediction and abnormal event detection. Neural Networks 108, pp. 466 – 478. External Links: ISSN 0893-6080, Document, Link Cited by: §I.
  • [18] M. Gori, G. Monfardini, and F. Scarselli (2005-07) A new model for learning in graph domains. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., Vol. 2, pp. 729–734 vol. 2. External Links: Document, ISSN 2161-4393 Cited by: 3rd item.
  • [19] A. Graves (2012) Long short-term memory. In Supervised Sequence Labelling with Recurrent Neural Networks, pp. 37–45. External Links: ISBN 978-3-642-24797-2, Document, Link Cited by: §III-C1.
  • [20] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi (2018-06) Social gan: socially acceptable trajectories with generative adversarial networks. Cited by: §I.
  • [21] B. Hammer (2000) On the approximation capability of recurrent neural networks. Neurocomputing 31 (1-4), pp. 107–123. Cited by: §III-C1.
  • [22] D. Helbing and P. Molnar (1995) Social force model for pedestrian dynamics. Physical review E 51 (5), pp. 4282. Cited by: §I.
  • [23] T. Hirakawa, T. Yamashita, T. Tamaki, and H. Fujiyoshi (2018) Survey on vision-based path prediction. Cham, pp. 48–64. External Links: ISBN 978-3-319-91131-1 Cited by: §I.
  • [24] S. Hochreiter and J. Schmidhuber (1997-12) Long short-term memory. Neural computation 9, pp. 1735–80. External Links: Document Cited by: §III-C1.
  • [25] S. Hoermann, M. Bach, and K. Dietmayer (2018-05) Dynamic occupancy grid prediction for urban autonomous driving: a deep learning approach with fully automatic labeling. pp. 2056–2063. External Links: Document, ISSN 2577-087X Cited by: 2nd item, §III-A3, §III-C2, TABLE I, TABLE II, TABLE III.
  • [26] Y. Hu, W. Zhan, and M. Tomizuka (2018-06) Probabilistic prediction of vehicle semantic intention and motion. In 2018 IEEE Intelligent Vehicles Symposium (IV), Vol. , pp. 307–313. External Links: Document, ISSN 1931-0587 Cited by: 1st item, §III-A2, TABLE I, TABLE III.
  • [27] C. Ju, Z. Wang, C. Long, X. Zhang, G. Cong, and D. E. Chang (2019-02) Interaction-aware Kalman Neural Networks for Trajectory Prediction. arXiv e-prints. External Links: 1902.10928 Cited by: §III-B2, TABLE II.
  • [28] T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. ArXiv abs/1609.02907. Cited by: 3rd item, TABLE III.
  • [29] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler (2015) Skip-thought vectors. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 3294–3302. External Links: Link Cited by: 1st item.
  • [30] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 1097–1105. External Links: Link Cited by: §III-C2.
  • [31] S. K. Kumaran, D. P. Dogra, and P. P. Roy (2019) Anomaly detection in road traffic using visual surveillance: A survey. CoRR abs/1901.08292. External Links: Link, 1901.08292 Cited by: §I.
  • [32] D. Lee, Y. P. Kwon, S. McMains, and J. K. Hedrick (2017-10) Convolution neural network-based lane change intention prediction of surrounding vehicles for acc. pp. 1–6. External Links: Document, ISSN 2153-0017 Cited by: §III-A3, §III-B1, §III-C2, TABLE I, TABLE II, TABLE III.
  • [33] N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. S. Torr, and M. Chandraker (2017-07) DESIRE: distant future prediction in dynamic scenes with interacting agents. Cited by: 2nd item, 2nd item, §III-A3, TABLE I, TABLE II, TABLE III.
  • [34] S. Lefèvre, D. Vasquez, and C. Laugier (2014-07-23) A survey on motion prediction and risk assessment for intelligent vehicles. ROBOMECH Journal 1 (1), pp. 1. External Links: ISSN 2197-4225, Document, Link Cited by: §I, §II, §III-A, §III.
  • [35] C. Li, Z. Zhang, W. Sun Lee, and G. Hee Lee (2018-06) Convolutional sequence to sequence model for human dynamics. Cited by: §III-C2.
  • [36] X. Li, X. Ying, and M. C. Chuah (2019) GRIP: graph-based interaction-aware trajectory prediction. arXiv preprint arXiv:1907.07792. Cited by: 3rd item, §III-A2, §III-B2, TABLE I, TABLE II, TABLE III, §IV-B, §IV-B, TABLE IV.
  • [37] W. Luo, B. Yang, and R. Urtasun (2018-06) Fast and furious: real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. Cited by: §III-A4, §III-B2, §III-C2, TABLE I, TABLE II, TABLE III.
  • [38] J. Martinez, M. J. Black, and J. Romero (2017-07) On human motion prediction using recurrent neural networks. Cited by: 1st item.
  • [39] G. Neubig (2017) Neural machine translation and sequence-to-sequence models: a tutorial. arXiv preprint arXiv:1703.01619. Cited by: 1st item.
  • [40] H. Noh, S. Hong, and B. Han (2015-12) Learning deconvolution network for semantic segmentation. Cited by: §III-C2, TABLE III.
  • [41] D. Nuss, S. Reuter, M. Thom, T. Yuan, G. Krehl, M. Maile, A. Gern, and K. Dietmayer (2018) A random finite set approach for dynamic occupancy grid maps with real-time application. The International Journal of Robotics Research 37 (8), pp. 841–866. External Links: Document, Link, Cited by: 2nd item, §III-A3.
  • [42] S. H. Park, B. Kim, C. M. Kang, C. C. Chung, and J. W. Choi (2018-06) Sequence-to-sequence prediction of vehicle trajectory via lstm encoder-decoder architecture. pp. 1672–1678. External Links: Document, ISSN 1931-0587 Cited by: 2nd item, 1st item, §III-A1, TABLE I, TABLE II, TABLE III.
  • [43] D. J. Phillips, T. A. Wheeler, and M. J. Kochenderfer (2017-06) Generalizable intention prediction of human drivers at intersections. pp. 1665–1670. External Links: Document, ISSN Cited by: 1st item, §III-A2, §III-B1, TABLE I, TABLE II, TABLE III, 1st item, item 3.
  • [44] (2016-05) Research on the impacts of connected and autonomous vehicles(cavs) on traffic flow: summary report. Technical report UK Department for Transport. External Links: Link Cited by: §I.
  • [45] A. Rudenko, L. Palmieri, M. Herman, K. M. Kitani, D. M. Gavrila, and K. O. Arras (2019) Human motion trajectory prediction: A survey. CoRR abs/1905.06113. External Links: Link, 1905.06113 Cited by: §I.
  • [46] S. Ruder (2017) An overview of multi-task learning in deep neural networks. CoRR abs/1706.05098. External Links: Link, 1706.05098 Cited by: §III-A4.
  • [47] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018-06) MobileNetV2: inverted residuals and linear bottlenecks. Cited by: §III-C2, TABLE III.
  • [48] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2009-01) The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. External Links: Document, ISSN 1045-9227 Cited by: 3rd item.
  • [49] M. Schreiber, S. Hoermann, and K. Dietmayer (2018) Long-term occupancy grid prediction using recurrent neural networks. CoRR abs/1809.03782. External Links: Link, 1809.03782 Cited by: 2nd item, 2nd item, §III-A3, TABLE I, TABLE II, TABLE III.
  • [50] M. S. Shirazi and B. T. Morris (2017-01) Looking at intersections: a survey of intersection monitoring, behavior and safety analysis of recent studies. IEEE Transactions on Intelligent Transportation Systems 18 (1), pp. 4–24. External Links: Document, ISSN 1524-9050 Cited by: §I.
  • [51] T. Toledo and D. Zohar (2007) Modeling duration of lane changes. Transportation Research Record 1999 (1), pp. 71–78. External Links: Document, Link, Cited by: 6th item.
  • [52] Traffic analysis tools: next generation simulation- fhwa operations. Note: 2019-08-15 Cited by: §IV-B.
  • [53] B. T. M. Trivedi (2013) Understanding vehicular traffic behavior from video: a survey of unsupervised approaches. Journal of Electronic Imaging 22 (4), pp. 1 – 16 – 16. External Links: Document, Link, Cited by: §I.
  • [54] A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu (2016) WaveNet: a generative model for raw audio.. SSW 125. Cited by: §III-C2.
  • [55] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph attention networks. In International Conference on Learning Representations, External Links: Link Cited by: 3rd item, TABLE III.
  • [56] L. Xin, P. Wang, C. Chan, J. Chen, S. E. Li, and B. Cheng (2018-11) Intention-aware long horizon trajectory prediction of surrounding vehicles using dual lstm networks. pp. 1441–1446. External Links: Document, ISSN 2153-0017 Cited by: 2nd item, §III-A1, §III-B3, TABLE I, TABLE II, TABLE III, §IV-B, TABLE IV.
  • [57] W. Zhan, A. L. de Fortelle, Y. Chen, C. Chan, and M. Tomizuka (2018-06) Probabilistic prediction from planning perspective: problem formulation, representation simplification and evaluation metric. pp. 1150–1156. External Links: Document, ISSN 1931-0587 Cited by: §II.
  • [58] A. Zyner, S. Worrall, and E. Nebot (2018-07) A recurrent neural network solution for predicting driver intention at unsignalized intersections. IEEE Robotics and Automation Letters 3 (3), pp. 1759–1764. External Links: Document, ISSN 2377-3766 Cited by: 1st item, §III-A1, §III-B1, TABLE I, TABLE II, TABLE III.
  • [59] A. Zyner, S. Worrall, J. Ward, and E. Nebot (2017-06) Long short term memory for driver intent prediction. pp. 1484–1489. External Links: Document, ISSN Cited by: 1st item, §III-A1, §III-B1, TABLE I, TABLE II, TABLE III.
  • [60] A. Zyner, S. Worrall, and E. M. Nebot (2018) Naturalistic driver intention and path prediction using recurrent neural networks. CoRR abs/1807.09995. External Links: Link, 1807.09995 Cited by: 2nd item, 1st item, §III-A1, §III-B2, §III-B4, TABLE I, TABLE II, TABLE III.