The task of multimodal trajectory prediction of road agents was thoroughly analyzed during past years and has led to the emergence of several groups of methods. One common approach is to use Computer Vision (CV) models with rasterized scene images as inputsmultimodaldcn, hdmapsconvnets. However, powerful CV models have a large number of parameters, could be computationally intensive and quite slow on inference. The disadvantages also include encoding a quite large amount of irrelevant to the underlying process information from the entire scene. To overcome these drawbacks most state-of-the-art approaches lean on the graph structure VectorNet, LaneGCN
. Graph-based methods take into account only relevant to the driving patterns information and can provide better quality due to the expressive power of graph neural networksGCN, GAT. Several approaches further improve the results by increasing the complexity of trajectory decoders TNT, mmTransformer.
However, the problem of uncertainty estimation (UE) in the context of VMP has not been widely covered yet. Previous works uncertaintydjuric, aleatoric, RIP consider a limited number of methods and are either disconnected from state-of-the-art approaches in VMP or recent advances in UE field DUQ, SNGP, DUE
or both. On the other hand, Bayesian Deep Learning field would benefit from benchmarking on large industrial datasets.
In this work we constructed and benchmarked the solution that simultaneously tries to meet the needs of VMP task and uses recent advances in both VMP and UE fields. In what follows, we describe the dataset and our solution.
In our work we adopt a single forward pass method for uncertainty estimation. The resulting model has two parts: multimodal motion prediction component and external neural network single_det to model the uncertainty. The general setting is shown on the Figure 1. Both components use graph representation of the scene which we describe in the following section and process them with graph neural networks (GNN).
2.1 Problem Statement
The Vehicle Motion prediction part of Shifts Dataset Shifts contains 5 seconds of past and 5 seconds of future states for all agents in a scene along with overall scene features. The goal of the challenge is to build a model that predicts future trajectories in the horizon of timesteps along with their confidences and overall scene uncertainty for each scene .
The dataset , where are features of the scenes, are ground truth trajectories and is the scene index, is divided into 3 parts: train part contains only scenes from Moscow with no precipitation, but there are different locations and precipitations in development and evaluation parts in order to create data shift setting.
2.2 Data Preprocessing
We extract geometries for parts of the road lanes and crosswalk polygons from provided raw data and use past observed trajectories of all agents in the scene as is. We refer to these elementary geometries as polylines following VectorNet.
Following this, we transform coordinates so that the origin is located at the target agent’s last observed position, and the vehicle is headed towards positive direction of x-axis. We select points within squared bounding box of size meters centered at the origin in order to capture only relevant information. After that we redistribute points of lane and crosswalk geometries to impose the constant distance of meters between adjacent points. Finally, we create fully-connected subgraphs for each polyline and invert them. In the end of all operations the whole scene is represented as a graph with fully-connected components which correspond to polylines. An example of the processed scene is shown on the Figure 1.
We construct the node feature-matrix of the form , where , are the coordinates of "start" and "end" adjacent points which formed a node after inversion,
are feature-vectors which we fill differently for various polyline types andis the node index. Namely, we use maxspeed, lane priority and lane availability for lane polylines. Lane availability status takes into account the traffic light state only at the latest available timestamp. For agent polylines we fill in the following ones: timestamp, vectors of velocity and acceleration, yaw. It is worth mentioning that in contrast to VectorNet we do not use integer ids of polylines as features.
2.3 Multimodal Model
We adopt VectorNet VectorNet
architecture to obtain a hidden representation for each scene. VectorNet is an hierarchical graph neural network that firstly builds vector representations for each polyline subgraph and then propagates signal among all obtained representations. The model is trained using corrected NLL of gaussian mixture as loss function:.
At the first stage we obtain embeddings for each node of fully connected subgraphs by applying Message Passing MessagePassing:
where - representation of node with index at layer , - neighbourhood of node , which is formed by all other nodes in polyline in our case, - aggregation function, and - update and message functions respectively. We use as aggregation function, MLP as message and concatenation as update. After that we apply maxpooling on top of node embeddings in order to get vector representation for the whole polyline with index .
At the second stage we construct fully-connected graph from all polyline representations and apply graph transformer convolution graphtransfromer:
Here are weight matrices and is a hidden dimension.
Motivated by suggested in VectorNet directions of future work and success of stacked-transformers architecture mmTransformer in motion prediction task, we extend VectorNet model with Transformer-based decoder. The whole architecture of the model is shown on the Figure 2. We utilize all polyline embeddings from the previous step as keys and values. Queries are formed by copies of target agent’s embedding summed with learnable vectors for each trajectory mode. Additionally, we add learnable positional embeddings to queries and keys in decoder self-attention and to queries in encoder-decoder attention as it was suggested in mmTransformer.
2.4 Uncertainty Model
We apply spectral-normalized Gaussian process SNGP (SNGP) for uncertainty estimation. Since obtaining good representations is crucial for this task, we pretrain plain VectorNet encoder with shallow MLP layer on the same multimodal objective. We exclude transformer decoder to ensure that information would not be shared between encoder and transformer decoder. Obtained embeddings for development part of dataset are clearly separable by motion type (see Figure 3).
After that we train SNGP head on top of pretrained encoder using unimodal trajectory prediction task. We insert single spectral normalized linear layer before gaussian-process output layer. During the training process SNGP optimizes posterior probabilityof target agent’s trajectory with Gaussian likelihood . Random Fourier feature expansion RFF with dimension approximates output as , where , and has a normal prior.
SNGP further approximates posterior by using Laplace approximation , where is MAP estimate obtained by gradient descent and the update rule for with ridge factor and discount factor is written as follows: ,
. We estimate uncertainty as posterior variance for the scene:.
3 Experiments and Results
The model was implemented using PyTorchPyTorch and PyTorch Geometric PyG and trained on single RTX 3090 GPU on of the data with batch size of 32.
VectorNet encoder uses 3 layers of subgraph Message-Passing with hidden dimensions of 64 and transformer convolution layer with 2 attention heads of size 64. The heads are averaged instead of concatenation. Decoder has 2 transformer blocks with 4 heads and hidden dimension of 128. Feed-Forward network has 2 linear layers with intermediate dimension of 256. The inputs (state+mode embeddings) are not scaled before adding positional embeddings.
SNGP model utilizes pretrained encoder with the same parameters. We set , , and . Tuning these parameters could be quite tricky since SNGP is sensible to batch size and volume of data.
For all models the initial learning rate was set to and Adam optimizer was used. For a multimodal model we used MultiStepLR scheduler with milestones with a decay factor and the model was trained for a total of epochs. We pretrained VectorNet encoder for epochs with decay every 5 epochs by a factor 0.3. SNGP head was trained for a epochs. We reset covariance matrix at the end of each epoch.
The results on the evaluation data are shown in the Table 1. As it is shown on the Figure 3, there is an expected trend: higher values of CNLL correlate with high values of uncertainty measure.
We considered a composite model for multimodal trajectory prediction and uncertainty estimation in vechicle motion prediction task. The resulting model shows a good performance in multimodal task and ability to detect out-of-distribution examples.