Driving a car is a complex activity that requires drivers to understand the involved multi-actor scenes in real time and actions in a rapidly changing environment in a fraction of second. To be able to fully rely on autonomous vehicles to drive autonomously, desirable to correctly assess the confidence of the algorithm in its predictions, including situations under the condition of distributional shifts, e.g. in a unseen (new to algorithms) roads, cities, countries.
In order to fully rely at autonomous vehicles, it is necessary to be confident of a high level of generalization of all algorithms used for autonomous driving.
The motivation to understand and predict human motion is immense and it has a deep impact in related topics, such as, decision making, path planning, autonomous navigation, surveillance, tracking, scene under-standing, anomaly detection, etc.
The problem of forecasting where cars will be in the near future is, however, ill-posed by nature: human beings tend to be unpredictable on their decisions and car driving is neither exempt of it. These random nature in motion brings an open challenge to prediction algorithms, where algorithms are desired to be accurate and correctly grasp the uncertainty associated with their predictions.
The contributions of this work are summarized as follows: 1) we propose a unified transformer-based motion prediction framework for both multi-modal trajectory prediction and uncertainty estimation. 2) Our proposed approach achieves state-of-the-art performance, and ranks 1on the Shifts Vehicle Motion Prediction Competition.
2 Related Work
The motion prediction task is one of the most important in the field of autonomous driving and has recently attracted a lot of attention from both academia and industry alahi2016social lee2017desire gupta2018social chai2019multipath salzmann2020trajectron++ postnikov2020hsfm.
Broadly, modern motion prediction methods can be divided in two classes:
1) Models where scene context information are processed from vectorized maps (HD maps)gao2020vectornet liang2020learning.
2) Models where high-definition maps and surroundings of each vehicle in cars’s vicinity rasterized to image representation, thus providing complete context and information necessary for accurate prediction of future trajectory djuric2020uncertainty postnikov2021covariancenet fang2020tpnet.
Recently models based at transformers architectures , have shown theirs applicability both at computer vision tasksdosovitskiy2020image liu2021swin, and at sequence to sequence tasks radford2018improving, devlin2018bert, which opens a high potential of applicability Transformer based approaches for motion prediction task.
We assume that object detections and tracks are provided by perception stack (running on the Yandex self driving car (SDC) fleet malinin2021shifts) and focus only on the motion prediction.
The proposed method goal is to predict the most-likely movement trajectory of vehicles at time seconds in the future and model’s scalar uncertainty estimates, which can later be used in subsequent SDC pipeline algorithms as an estimate of the forecast uncertainty with a scene context that is particularly familiar or low risk in the case of low estimated uncertainty, or unfamiliar or high risk in the case of high estimated uncertainty.
In this section we describe the architecture of our model, the loss function used for training and implementation details.
The future is ambiguous and human motion is unpredictable and multimodal by nature. In order to account for such multimodal nature, we aim to produce up to K=5 different hypotheses (proposals) and their probabilities for the future trajectory which will be evaluated against the ground truth trajectory.
3.1 Input representation
Context information about the state of dynamic objects (i.e., vehicles, pedestrians), described by its position, velocity, linear acceleration, and orientation together with context information about the HD map including lane information (e.g., traffic direction, lane priority, speed limit, traffic light association), road boundaries, crosswalks, and traffic light states are rasterized into multi-channel images which are passed to transformer based model.
3.2 Model architecture
Our method can be represented as an image-based regression, model architecture shown at Fig 1 consists of two main stages :
Transformer-based image processing encoder, namely modified ViTdosovitskiy2020image (modified in accordance to process multi-channel images) acts as a current state estimation for single vehicle. ViT uses multi-head self-attentionvaswani2017attention removing image-specific inductive biases compared to CNN approaches and self-attention layers in ViT allows it to integrate information globally across the entire image so ViT doesn’t suffer from lacks of a global understanding of the images.
Transformer based decoder witch predicts different hypothesis (proposals) for the future trajectory with the corresponding confidence values which are normalized using softmax operator such that . Apart of predicting multi-trajectory plans proposed model predicts overall scene uncertainty score, which is described in more details later.
Multi channel images are initially split into fixed-size patches and processed further by to visual transformer model. Dense encoded latent state produced by ViT later repeated N times, according to number of desired trajectories to be predicted.
Each of latent state concatenated with
, samples from Normal distribution, which is according to our internal experiments gives minor improvements in metrics comparing to sinusoidal Positional Encodingvaswani2017attention, that can be interpreted by the absence of a relative or absolute positional correlation between sequential states. On the other hand samples from Normal distribution, concatenated with repeated latent state, helps decoder to transform repeated latent state to more diverse trajectories.
At the same time, property of multi-head attention attend globally, therefore, learning long-range relationships provides more opportunities for a correct assessment of uncertainties.
3.3 Loss function
We model possible future trajectories as the mixture of Gaussian distributions, as it is allows model to predict multi-modal distribution, comparing to widely spread ADE loss, examples of model’s predictions are shown at Fig 4. In this case our network outputs the means positions of the Gaussians
while we fix the covariance of every Gaussian in the mixture to be equal to the identity matrix, and for each trajectory model predicts trajectory probability which are normalized using softmax operator such that . Then, given predicted for the loss function we can use negative log-likelihood (NLL) of mixture of Gaussians defined by the predicted proposals given the ground truth coordinates .
where T is a prediction horizon, K - number of hypotheses
We compute negative log probability of the ground truth trajectory under the predicted mixture of Gaussians with the means equal to the predicted trajectories and the identity matrix I as covariance:
In order to evaluate model uncertainty, we propose to use second loss which is basically Root Mean Squared Error between predicted uncertainty measure and trajectory NLL value.
where U - predicted uncertainty score, RMSE - Root Mean Squared Error function.
3.4 Implementation details
The output of our model is trajectories, each containing two dimensional coordinates. We train our model using AdamW loshchilov2017decoupled
optimizing for 40 full epochs of training set provided by shifts vehicle motion prediction challengemalinin2021shifts, and SGD for yet other 40 full epochs. We use a learning rate of for AdamW loshchilov2017decoupled optimizer, weight decay and a batch size of 1024. We use cosine annealing scheduler for AdamWloshchilov2017decoupled with warm-up. For SGD optimizer we use initial learning rate of and cosine annealing scheduler with warm-up and restarts.
As an encoder we utilise ViT-Base model, with layers number and Hidden size , we modify final layer of ViT to output 512 features. dimension of samples from normal distribution is 8, Transformer decoder consist of layers with hidden size , number of heads .
We training our model with ViT dosovitskiy2020image backbone for 7 days on a two NVIDIA V100 GPU with 2*32Gb VRAM in total.
We evaluate the effectiveness of Transformer based trajectory prediction method on the Shifts Vehicle Motion Prediction Challenge malinin2021shifts. As shown in Table 1, our method ranks 1 on the leaderboard. The Shifts Motion Prediction Challenge main metric is R-AUC cNLLmalinin2021shifts, which provides a full picture of the models performance incorporating both uncertainty estimation and predicted trajectories accuracy. It can be seen that despite the fact that accoring to R-AUC cNLL metric second method gives slightly better trajectories accuracy in terms of ADE, FDE, cNLL, our transformer-based approach produces more reliable uncertainties which reflect in better overall R-AUC cNLL.
|Team Name||R-AUC cNLL||cNLL||Min ADE (k=5)||Min FDE (k=5)|
|Alexey & Dmitry||2.62||15.59||0.50||0.94|
|NTU CMLab Mira||8.63||61.86||0.80||1.72|
In this work, we propose an fully tranformer-based trajectory prediction model. It outperforms previous cnn-based and graph-based methods at the task of uncertainty aware motion prediction. Our model achieves state-of-the-art performance and ranks 1 on the Shifts Motion Prediction Challenge.