1 Introduction
Driving a car is a complex activity that requires drivers to understand the involved multiactor scenes in real time and actions in a rapidly changing environment in a fraction of second. To be able to fully rely on autonomous vehicles to drive autonomously, desirable to correctly assess the confidence of the algorithm in its predictions, including situations under the condition of distributional shifts, e.g. in a unseen (new to algorithms) roads, cities, countries.
In order to fully rely at autonomous vehicles, it is necessary to be confident of a high level of generalization of all algorithms used for autonomous driving.
The motivation to understand and predict human motion is immense and it has a deep impact in related topics, such as, decision making, path planning, autonomous navigation, surveillance, tracking, scene understanding, anomaly detection, etc.
The problem of forecasting where cars will be in the near future is, however, illposed by nature: human beings tend to be unpredictable on their decisions and car driving is neither exempt of it. These random nature in motion brings an open challenge to prediction algorithms, where algorithms are desired to be accurate and correctly grasp the uncertainty associated with their predictions.
The contributions of this work are summarized as follows: 1) we propose a unified transformerbased motion prediction framework for both multimodal trajectory prediction and uncertainty estimation. 2) Our proposed approach achieves stateoftheart performance, and ranks 1
on the Shifts Vehicle Motion Prediction Competition.2 Related Work
The motion prediction task is one of the most important in the field of autonomous driving and has recently attracted a lot of attention from both academia and industry alahi2016social lee2017desire gupta2018social chai2019multipath salzmann2020trajectron++ postnikov2020hsfm.
Broadly, modern motion prediction methods can be divided in two classes:
1) Models where scene context information are processed from vectorized maps (HD maps)
gao2020vectornet liang2020learning.2) Models where highdefinition maps and surroundings of each vehicle in cars’s vicinity rasterized to image representation, thus providing complete context and information necessary for accurate prediction of future trajectory djuric2020uncertainty postnikov2021covariancenet fang2020tpnet.
Recently models based at transformers architectures , have shown theirs applicability both at computer vision tasks
dosovitskiy2020image liu2021swin, and at sequence to sequence tasks radford2018improving, devlin2018bert, which opens a high potential of applicability Transformer based approaches for motion prediction task.3 Method
We assume that object detections and tracks are provided by perception stack (running on the Yandex self driving car (SDC) fleet malinin2021shifts) and focus only on the motion prediction.
The proposed method goal is to predict the mostlikely movement trajectory of vehicles at time seconds in the future and model’s scalar uncertainty estimates, which can later be used in subsequent SDC pipeline algorithms as an estimate of the forecast uncertainty with a scene context that is particularly familiar or low risk in the case of low estimated uncertainty, or unfamiliar or high risk in the case of high estimated uncertainty.
In this section we describe the architecture of our model, the loss function used for training and implementation details.
The future is ambiguous and human motion is unpredictable and multimodal by nature. In order to account for such multimodal nature, we aim to produce up to K=5 different hypotheses (proposals) and their probabilities for the future trajectory which will be evaluated against the ground truth trajectory.
3.1 Input representation
Context information about the state of dynamic objects (i.e., vehicles, pedestrians), described by its position, velocity, linear acceleration, and orientation together with context information about the HD map including lane information (e.g., traffic direction, lane priority, speed limit, traffic light association), road boundaries, crosswalks, and traffic light states are rasterized into multichannel images which are passed to transformer based model.
3.2 Model architecture
Our method can be represented as an imagebased regression, model architecture shown at Fig 1 consists of two main stages :

Transformerbased image processing encoder, namely modified ViTdosovitskiy2020image (modified in accordance to process multichannel images) acts as a current state estimation for single vehicle. ViT uses multihead selfattentionvaswani2017attention removing imagespecific inductive biases compared to CNN approaches and selfattention layers in ViT allows it to integrate information globally across the entire image so ViT doesn’t suffer from lacks of a global understanding of the images.

Transformer based decoder witch predicts different hypothesis (proposals) for the future trajectory with the corresponding confidence values which are normalized using softmax operator such that . Apart of predicting multitrajectory plans proposed model predicts overall scene uncertainty score, which is described in more details later.
Multi channel images are initially split into fixedsize patches and processed further by to visual transformer model. Dense encoded latent state produced by ViT later repeated N times, according to number of desired trajectories to be predicted.
Each of latent state concatenated with
, samples from Normal distribution, which is according to our internal experiments gives minor improvements in metrics comparing to sinusoidal Positional Encoding
vaswani2017attention, that can be interpreted by the absence of a relative or absolute positional correlation between sequential states. On the other hand samples from Normal distribution, concatenated with repeated latent state, helps decoder to transform repeated latent state to more diverse trajectories.At the same time, property of multihead attention attend globally, therefore, learning longrange relationships provides more opportunities for a correct assessment of uncertainties.
3.3 Loss function
We model possible future trajectories as the mixture of Gaussian distributions, as it is allows model to predict multimodal distribution, comparing to widely spread ADE loss, examples of model’s predictions are shown at Fig 4. In this case our network outputs the means positions of the Gaussians
while we fix the covariance of every Gaussian in the mixture to be equal to the identity matrix
, and for each trajectory model predicts trajectory probability which are normalized using softmax operator such that . Then, given predicted for the loss function we can use negative loglikelihood (NLL) of mixture of Gaussians defined by the predicted proposals given the ground truth coordinates .(1) 
(2) 
where T is a prediction horizon, K  number of hypotheses
We compute negative log probability of the ground truth trajectory under the predicted mixture of Gaussians with the means equal to the predicted trajectories and the identity matrix I as covariance:
(3) 
In order to evaluate model uncertainty, we propose to use second loss which is basically Root Mean Squared Error between predicted uncertainty measure and trajectory NLL value.
(4) 
where U  predicted uncertainty score, RMSE  Root Mean Squared Error function.
3.4 Implementation details
The output of our model is trajectories, each containing two dimensional coordinates. We train our model using AdamW loshchilov2017decoupled
optimizing for 40 full epochs of training set provided by shifts vehicle motion prediction challenge
malinin2021shifts, and SGD for yet other 40 full epochs. We use a learning rate of for AdamW loshchilov2017decoupled optimizer, weight decay and a batch size of 1024. We use cosine annealing scheduler for AdamWloshchilov2017decoupled with warmup. For SGD optimizer we use initial learning rate of and cosine annealing scheduler with warmup and restarts.As an encoder we utilise ViTBase model, with layers number and Hidden size , we modify final layer of ViT to output 512 features. dimension of samples from normal distribution is 8, Transformer decoder consist of layers with hidden size , number of heads .
We training our model with ViT dosovitskiy2020image backbone for 7 days on a two NVIDIA V100 GPU with 2*32Gb VRAM in total.
4 Experiments
We evaluate the effectiveness of Transformer based trajectory prediction method on the Shifts Vehicle Motion Prediction Challenge malinin2021shifts. As shown in Table 1, our method ranks 1 on the leaderboard. The Shifts Motion Prediction Challenge main metric is RAUC cNLLmalinin2021shifts, which provides a full picture of the models performance incorporating both uncertainty estimation and predicted trajectories accuracy. It can be seen that despite the fact that accoring to RAUC cNLL metric second method gives slightly better trajectories accuracy in terms of ADE, FDE, cNLL, our transformerbased approach produces more reliable uncertainties which reflect in better overall RAUC cNLL.
Team Name  RAUC cNLL  cNLL  Min ADE (k=5)  Min FDE (k=5) 
HOME  3.45  24.63  0.54  1.07 
Alexey & Dmitry  2.62  15.59  0.50  0.94 
NTU CMLab Mira  8.63  61.86  0.80  1.72 
SBteam (ours)  2.57  15.67  0.53  1.01 
5 Conclusions
In this work, we propose an fully tranformerbased trajectory prediction model. It outperforms previous cnnbased and graphbased methods at the task of uncertainty aware motion prediction. Our model achieves stateoftheart performance and ranks 1 on the Shifts Motion Prediction Challenge.