## 1 Introduction

One of the key components of a self-driving system is motion prediction [werling2010optimal, casas2018intentnet]. It is crucial for an autonomous vehicle (AV) to reliably predict future trajectories of other traffic agents, such as cars, cyclists, and pedestrians. However, future motion prediction and AV’s route planning are still very challenging problems and are yet to be solved for an arbitrary environment scenario. In this paper, we tackle the motion prediction task. The most prominent approaches include image-based models which leverage birds-eye-view rasterized scene representations [lee2017desire, cui2019multimodal, chai2019multipath, hong2019rules, phan2020covernet, kawasaki2021multimodal] and methods incarnated using graph neural networks [casas2019spatially, gao2020vectornet, zhao2020tnt].

We establish a simple and yet efficient motion prediction baseline based purely on Convolutional Neural Networks (CNNs). Our model takes a raster image centered around a target agent as input and directly predicts a set of possible trajectories along with their confidences. The raster image is obtained by rasterization of a scene and history of the all the agents. We evaluate our model on the 2021 Waymo Open Dataset Motion Prediction Challenge [waymoOpenMotion2021]

where it achieves very competitive performance: Ranks 1st using minimum average displacement error and 3rd using mAP score. We open-source our code

2 and hope that our baseline will provide a reference for future research.## 2 Method

We assume that object tracks are provided by some perception system [qi2021offboard, yang2021auto4d] and focus only on the motion prediction. Our task is to predict the trajectory of an agent for the next

seconds in the future. In this section, first, we describe how we rasterize the data and produce multi-channel images. After that we describe the architecture of our model and the loss function used for training.

#### Rasterization

To generate training images from raw data, we rasterize historical trajectories of the agents along with the corresponding map providing a context for the road environment. To standardise the input, we rotate and shift the frame in such a way that the target agent at the time of prediction is always located at a fixed position on the raster image and its velocity is aligned with the X-axis.

#### Model

The future is ambiguous, so we aim to produce

different hypothesis (proposals) for the future trajectory which will be evaluated against the ground truth trajectory. We incarnate our model as an image-based regression. Our model consists of CNN backbone pretrained on ImageNet

[imagenet] with one fully-connected layer attached on top (see Fig. 2). The model takes a multi-channel raster image as input and predicts trajectories along with the corresponding confidence values , which are normalized using softmax operator such that .#### Loss function

The straightforward solution would be to use a Mean Squared Error (MSE) loss. However, this loss does not allow a probabilistic modelling of multiple hypotheses and it showed poor performance in our preliminary experiments. Instead, we propose to model possible future trajectories as the mixture of Gaussian distributions. In this case our network outputs the means of the Gaussians while we fix the covariance of every Gaussian in the mixture to be equal to the identity matrix .

Then for the loss we can use *negative log-likelihood (NLL)* of this mixture of Gaussians defined by the predicted proposals given the ground truth coordinates. In other words, given a ground truth trajectory

and predicted trajectory hypotheses

we compute negative log probability of the ground truth trajectory under the predicted mixture of Gaussians with the means equal to the predicted trajectories and the identity matrix

as covariance:where

is the probability density function for the multivariate Gaussian distribution with mean

and covariance matrix . The loss can be further decomposed into the product of 1-dimensional Gaussians, and we get just a logarithm of the sum of the exponents:The proposed loss function does not explicitly penalize the model for producing very close trajectories. However, empirically we did not observe a mode collapse because combining all the probability mass into one mode leads to a higher risk strategy and higher loss values in case of a misprediction. Therefore, optimizing the proposed loss yields sufficient multimodality.

#### Inference

We select the number of components in the mixture equal to the desired number of predicted hypotheses. For example, during evaluation on Waymo Open Motion Dataset [waymoOpenMotion2021] we are allowed to provide up to hypotheses of future trajectory for a target agent, so we select

. Since we model the possible space of solutions using the probability distribution, it is beneficial to produce the most diverse set of hypotheses from our distribution. One of the ways to achieve this is to simply select means of the components comprising the predicted mixture of Gaussians along with the coefficients

as their confidences as final hypotheses for evaluation. While we admit that this might not be the optimal solution, we leave the exploration of other ways to sample trajectories from the predicted distribution for future work.## 3 Experiments

### 3.1 Dataset

We evaluate our approach on Waymo Open Motion Dataset [waymo, waymoOpenMotion2021] by submitting our predictions to the Waymo motion prediction challenge [waymo_leaderboard_motion_21]. This dataset contains object trajectories and corresponding 3D maps for segments. Each segment is a seconds recording of an object trajectory at Hz and map data for the area covered by the segment. A single sample comprises second of history and seconds of future data obtained by breaking the segments into 9-second windows with 5 second overlap. Every such sample contains up to agents marked as ”valid” for which the model needs to predict their positions for seconds into the future.

#### Rasterisation details

We create the preprocessing pipeline for Waymo Open Motion Dataset [waymoOpenMotion2021] which converts the raw data in TFRecord format to multi-channel raster images for each target object. For every agent we have second of history which is provided as snapshots taken at Hz and a snapshot at the time of prediction (current). So, in total we have snapshots for every dynamic object. We use the raster size , where the first channels is the RGB map (road lines, crosswalks, traffic lights, etc), and every history snapshot is represented by two extra channels: (1) The mask representing the location of the target agent, and (2) the mask representing all other agents nearby (see Fig. 3

). To eliminate the redundant degrees of freedom we shift and rotate the local coordinate system in such a way that the center of the target agent is located at pixel coordinate

and its velocity is aligned with the X-axis of the image.One of the major bottlenecks in the data pipeline is the speed of image rasterisation. To make training faster we cache the rasterised images to disk as compressed npz files. And during training we just load them from disk instead of costly online rasterisation. This results in significant speedup enabling us to read more than a hundred images per second using a single process.

We create raster images only for the agents with the flag equal to (meaning that they are ”valid”). Therefore, in total, we obtain training, validation and test images.

Method | mAP | Min ADE | Min FDE | Miss Rate | Overlap Rate | |
---|---|---|---|---|---|---|

Test | Waymo LSTM baseline [waymo_leaderboard_motion_21] | 0.1756 | 1.0065 | 2.3553 | 0.3750 | 0.1898 |

ReCoAt ( place) [ReCoAt] | 0.2711 | 0.7703 | 1.6668 | 0.2437 | 0.1642 | |

DenseTNT ( place) [denseTNT2021] | 0.3281 | 1.0387 | 1.5514 | 0.1573 | 0.1779 | |

MotionCNN-Xception71 (Ours) | 0.2136 | 0.7400 | 1.4936 | 0.2091 | 0.1560 | |

Val | MotionCNN-ResNet18 (Ours) | 0.1920 | 0.8154 | 1.6396 | 0.2552 | 0.1605 |

MotionCNN-Xception71 (Ours) | 0.2123 | 0.7383 | 1.4957 | 0.2072 | 0.1576 |

Object Type | mAP | Min ADE | Min FDE | Miss Rate | Overlap Rate | |
---|---|---|---|---|---|---|

Test | Vehicle | 0.2357 | 0.8946 | 1.8175 | 0.2138 | 0.0886 |

Pedestrian | 0.2175 | 0.4449 | 0.9131 | 0.1276 | 0.2725 | |

Cyclist | 0.1875 | 0.8803 | 1.7501 | 0.2860 | 0.1071 | |

Avg | 0.2136 | 0.7400 | 1.4936 | 0.2091 | 0.1560 | |

Val | Vehicle | 0.2371 | 0.8919 | 1.8154 | 0.2128 | 0.0877 |

Pedestrian | 0.2092 | 0.4387 | 0.9010 | 0.1254 | 0.2684 | |

Cyclist | 0.1905 | 0.8843 | 1.7707 | 0.2835 | 0.1168 | |

Avg | 0.2123 | 0.7383 | 1.4957 | 0.2072 | 0.1576 |

### 3.2 Metrics

Following the evaluation protocol in [waymoOpenMotion2021], we predict hypotheses for every target agent, but only trajectory points subsampled at Hz (which results in the subset of 2-dimensional coordinates from the predicted points) are used for computing test and validation metrics. Average Displacement Error, Final Displacement Error (FDE) are the commonly used metrics for evaluation:

where is the ground truth trajectory and is a predicted one. To evaluate multiple hypotheses we use minADE and minFDE:

minADE | |||

minFDE |

Additionally, following [waymoOpenMotion2021], we use a few other metrics such as Miss Rate(MR) and mean average precision (mAP). For the detailed explanation of these metrics we refer the reader to the work [waymoOpenMotion2021].

### 3.3 Implementation details

Our implementation is partially based on the winning solution in Lyft Motion Prediction Challenge [kaggle_lyft_challenge2020] by Sanakoyeu et al. [sanakoyeu2021lyft]. We use Xception71 [chollet2017xception] (up to the global averaged pooling) pretrained on Imagenet as a backbone. The output of our model is trajectories, each containing 2-dimensional coordinates. We train our model using AdamW [loshchilov2018decoupled] optimizing for iterations, thus using early stopping as a regularization. We use a learning rate of , weight decay and a batch size of . We also use cosine annealing scheduler with warm restarts [loshchilov2017sgdr] every iterations, and with , . Training our model with Xception71 [chollet2017xception] backbone took around days on a single NVIDIA V100 GPU with Gb VRAM.

### 3.4 Results

Results from the final leaderboard of the Waymo open dataset motion prediction challenge [waymo_leaderboard_motion_21] are presented in Tab. 1. Despite the simplicity of the proposed approach we secured the 3rd place according to the mAP metric. Moreover, our model is superior to the other competing methods according to Min ADE, Min FDE, and Overlap Rate metrics. Note that in contrast to methods [ReCoAt, denseTNT2021]

, our simple model achieves such impressive results without any use of advanced deep learning techniques or complex architectures.

To test a more lightweight architecture, we also trained our model using ResNet18 [resnet] as the backbone and evaluated it on the validation set (see Tab. 1). This architecture is 3x times faster to train than the one with Xception71 backbone, but it does not reach the same high performance showing that a sufficiently deep model is necessary for attaining good results. In Fig. 4 we show plots with train and validation loss values during training.

In Tab. 2 we also provide more detailed evaluation results for different object types separately.

## 4 Conclusion

We presented a simple yet strong baseline – MotionCNN which is based on CNNs and produces a distribution of the hypothetical trajectories for a target agent. The proposed model is straightforward to implement and easy to train. It utilizes a birds-eye-view rasterized scene representation, which we cache as multi-channel images for faster training. We evaluated our approach on Waymo Motion Prediction Challenge

[waymo_leaderboard_motion_21] where it ranked 3rd, despite being more simple then other competitors. We hope that our work will become a solid reference point for the future advancements in motion prediction.