Log In Sign Up

Learning Future Object Prediction with a Spatiotemporal Detection Transformer

We explore future object prediction – a challenging problem where all objects visible in a future video frame are to be predicted. We propose to tackle this problem end-to-end by training a detection transformer to directly output future objects. In order to make accurate predictions about the future, it is necessary to capture the dynamics in the scene, both of other objects and of the ego-camera. We extend existing detection transformers in two ways to capture the scene dynamics. First, we experiment with three different mechanisms that enable the model to spatiotemporally process multiple frames. Second, we feed ego-motion information to the model via cross-attention. We show that both of these cues substantially improve future object prediction performance. Our final approach learns to capture the dynamics and make predictions on par with an oracle for 100 ms prediction horizons, and outperform baselines for longer prediction horizons.


page 2

page 3

page 14


Unsupervised Video Prediction from a Single Frame by Estimating 3D Dynamic Scene Structure

Our goal in this work is to generate realistic videos given just one ini...

Generative Video Transformer: Can Objects be the Words?

Transformers have been successful for many natural language processing t...

End-to-end Contextual Perception and Prediction with Interaction Transformer

In this paper, we tackle the problem of detecting objects in 3D and fore...

Fourier-based Video Prediction through Relational Object Motion

The ability to predict future outcomes conditioned on observed video fra...

Disentangling Video with Independent Prediction

We propose an unsupervised variational model for disentangling video int...

Neural Allocentric Intuitive Physics Prediction from Real Videos

Humans are able to make rich predictions about the future dynamics of ph...

Experience-Embedded Visual Foresight

Visual foresight gives an agent a window into the future, which it can u...

1 Introduction

Autonomous robots, such as self-driving vehicles, need to make predictions about the future in order to plan safely. As a robot executes an action, its surroundings change and this change needs to be estimated in order to gauge the value of the action. Several representations for forecasting have been presented in the literature, such as predicting future RGB-values 

[31, 37, 40, 22]; instance segmentations [30]; birds-eye view instance segmentations [19]; and trajectories of seen objects [41]. The latter three tasks involves making predictions about dynamic objects, such as humans, vehicles, or animals. These tasks require the ability to estimate the dynamics of the scene – a challenging problem in itself – and use that for extrapolation [49, 30, 41]. Furthermore, the tasks are inherently ambiguous and methods need to present multiple hypotheses.

One of the core computer vision problems is the detection of objects in images. It is challenging to pose this problem in a completely learning based manner and methods long tackled the problem indirectly 

[39, 38, 23], , resorted to surrogate losses and relied on post-processing steps that were not modelled during offline training to make predictions. Carion  [3]

were the first to propose a learning formulation and neural network architecture that enabled end-to-end learning of this task. Their neural network predicted a set of object hypotheses and a differentiable objective function computed the similarity to the annotations. Several works have since then improved these designs for object detection 

[52, 33, 9, 26, 12] and adapted it for other computer vision tasks, such as multi-object tracking [32]; 3D object detection [34]; biomedical cell instance segmentation [36]; and visual object tracking [50].

In this work we investigate making predictions about the future. The state of the surroundings in the future is represented in the same way as in object detection, by a set of confidence, class, and bounding box triplets. In order to tackle this problem, it is natural to divide it into several steps: (i) detecting objects in a sequence of frames; (ii) form tracks and extrapolate trajectories; and (iii) anticipate new objects that might appear. None of these steps are explicit in our approach. Instead, we propose to train a neural network for this problem in an end-to-end manner. To this end, we adopt the formulation proposed by Carion  [3]. The neural network architectures proposed in their and in follow-up works make use of a key mechanism: cross-attention. This is the mechanism that enables an instance-level representation, originally referred to as object queries [3]

, to gather information from the deep feature maps extracted from the image. 

111See Greff  [14] for a discourse on different methods of representation. This mechanism, due to its flexibility, provides a natural way to incorporate two additional sources of information, both crucial for enabling an accurate understanding of the dynamics of the scene. Temporal information, in the form of past images, enables the neural network to better anticipate the motion of dynamic agents. Similarly, the agent’s own motion can significantly affect the contents of future images. We learn how to take this into account by explicitly incorporating information about the ego motion, such as velocity and rotation rate, into the network.

Figure 1: We explore the problem of future object prediction, where we aim to predict all objects visible in a future frame. We design a novel method that takes a sequence of frames as input ( and ) and predicts the objects in frame . The method is only provided ground truth for the single future frame. Here, we show one predicted object with corresponding attention maps for the input images. Our approach accurately predicts the future pedestrian by attending to him in prior frames, in an emergent form of video tracking.
Figure 2: An input image (top left) and future object predictions, overlaid on a future, unseen image (top right), as predicted by our method. A detection transformer [3, 33] is trained end-to-end to make such predictions. The neural network is adapted to process videos in Section 3.2 and ego-motion in Section 3.3.

Contributions Our main contributions are:

  1. We investigate the problem of future object prediction based on the publicly available nuImages and nuScenes [2] datasets. Our code will be made publicly available to facilitate future work on future object prediction.

  2. We propose to directly learn the task with transformers. Our experiments demonstrate the effectiveness of such an approach.

  3. We show that attention enables the model to process spatiotemporal information and make better predictions about the future. The model is able to learn this without annotated video, relying only on a single annotated future frame.

  4. In a detailed ablation study, we investigate different spatiotemporal mechanisms. We find that sequentially cross-attending to prior frames leads to the best performance. This novel mechanism has the advantage of also having linear complexity in number of frames processed by the network.

  5. We find that the sequential cross-attention mechanism can also be used to directly integrate ego-motion information into the network. In our experiments, we show that ego-motion information provides substantial benefits.

  6. We demonstrate that the future prediction of a large network can outperform lightweight object detectors that observe the future frame. This indicates trade-off that should be considered in real-time systems.

2 Related Work

Future Prediction Prediction of future video frames in terms of RGB-images has been previously studied with temporal convolution [22], adversarial learning [31, 25]

, recurrent neural networks 

[37, 40, 44], parallel multi-dimensional units [1], variational recurrent neural networks [5], atoms (from system identification) [27], and spatial transformers [28]

. This problem is not only interesting in its own right, but enables unsupervised learning of video representations 

[25]. The works of Villegas  [45] and Wu  [49] demonstrated that such video prediction may benefit from explicit modelling of dynamic objects.

More closely related to our work is future prediction in terms of more high-level concepts, such as trajectories of dynamic objects. This task has been previously tackled using detect-track-predict paradigm [6, 18, 42, 41]. Hu  [19] demonstrated the benefits of instead learning this task end-to-end. This was achieved via a differentiable warping [20] of feature maps to a reference frame with the help of ego motion. The warped feature maps were then processed with spatiotemporal 3D-convolutions. Perhaps most closely related to our work is the work of Luc  [30]. Luc proposed to predict future instance segmentations in an end-to-end fashion. To this end, MaskRCNN [15] was modified with a convolutional feature forecasting module. Li  [24] noted that the latency in object detectors provide a need of future prediction. The focus of their work was to analyze this aspect and they found that classical tracking-based approaches outperform the end-to-end trained neural network proposed by Luc .

Object Detection with Transformers The work of Carion  [3], DETR, showed how object detection can be learnt end-to-end. They proposed to let a representation of the desired object detection output – a set of object slots – cross-attend a representation of the input – a feature map extracted from the image. A concurrent work [7] showed how the cross-attention mechanism could be used to move between different output representations used in prior works, , anchor-boxes of FasterRCNN [39], corners of CornerNet [23], and center points of FCOS [43] or CenterNet [51]. Since then, several works have improved the effectiveness of transformer-based detectors. Dai  [9] proposed a pre-training stage. Wang  [47] proposed a multi-scale variant. Zhu  [52] replaced the attention layers with multi-head, multi-feature-level deformable convolution. Dai  [8] investigated another form of multi-feature-level deformable convolution. Gao  [12] improved the locality of the cross-attention heads via a Gaussian-like weighting. Meng  [33] proposed to change how the transformer makes use of its positional encodings. Amongst the most important effects of these works is the reduction in training time, compared to DETR. The work of Meng , for instance, reports a massive reduction: an entire order of magnitude.

Processing Spatiotemporal Data with Transformers A number of works extended the use of transformers into the 3-dimensional video domain. Girdhar  [13]

lets region-of-interest-pooled feature vectors cross-attend a video feature map, extracted by I3D 

[4]. The spatiotemporal structure, where in the image and at what point in time, is captured in the positional encodings. Jaegle  [21] proposed Perceiver, a general neural network architecture able to process data with different structure. Similar to the work of Girdhar , the spatiotemporal structure is captured via positional encodings and a cross-attention mechanism lets an output representation gather information from the input data. In their work, Perceiver is used to process 2D feature maps extracted from images, 3D feature maps extracted from videos, and point clouds. Wang  [48] use a similar mechanism for video instance segmentation (VIS). Duke  [11] experiments with spatiotemporal self-attention for video object segmentation (VOS), and propose to use local attention to reduce computational cost in long videos. In contrast to action recognition or point cloud classification, both VOS and VIS require output that is dense both spatially and in time. Meinhardt  [32] uses a DETR-style transformer for detection and tracking. This is achieved via a recurrent connection, where the output representation in one frame is reused in the next.

In this work, we extend DETR to spatiotemporal data. We experiment with several different mechanisms able to process spatiotemporal data, as illustrated in Figure 3: joint cross-attention [13, 21, 48], recurrent cross-attention [32], and a sequential cross-attention mechanism where previous points in time are processed with different transformer layers. Furthermore, we investigate the addition of ego-motion information to the model. Instead of using ego-motion and geometry to warp feature maps as in the work of Hu  [19], we let a series of cross-attention layers learn how to best make use of this information.

3 Method

We aim to design and train a neural network able to make predictions about the future states of objects in the image plane. First, we formalize this task and introduce a single-frame model – based on DETR [3] – that can be trained directly for this task. Next, we extend this model to be able to process spatiotemporal data, videos, and capture the dynamics of the scene. Last, we add additional ego-motion information to our proposed neural network. Such information is often readily available from on-board sensors and we hypothesize that this is a powerful cue for the neural network to utilize.

Figure 3: Illustration of the single-frame transformer encoder (top left) and three different spatiotemporal transformers: joint attention (top right), sequential cross-attention (bottom left), and recurrent transformer (bottom right).
Figure 4: Illustration of the single-frame transformer decoder (top left) and three different spatiotemporal transformers: joint attention (top right), sequential cross-attention (bottom left), and recurrent transformer (bottom left).

3.1 Learning Future Object Prediction

In standard object detection, we are provided an image at time , , and tasked with finding the set of objects visible in that image, , where enumerates the objects. Each object

is described by a vector of class probabilities

and a bounding box in the image plane , so that . In this work, we explore the problem of future object prediction. This task is similar to object detection, but instead of detecting objects in the current frame at time , the aim is to predict objects visible at a future point in time. By design, we remain close to the original object detection task. One of the advantages of that is that we can employ methods for object detection, a well-studied problem. These methods are typically also able to model uncertainties and multiple hypotheses by reporting multiple detections with lower confidence. This uncertainty modelling may be necessary when making predictions about the future.

For future object prediction, we aim to predict a set of objects visible at time ,


based on images and ego-motion information at times . As a starting point, we shall consider a method taking only a single image as input. In Section 3.2, the method is extended to take multiple images as input, and in Section 3.3, we extend it to also exploit ego-motion. For a single-frame approach, we formally have


Even though a single image does not directly capture any dynamics, the network is free to learn strong priors based on, , the type of object, its rotation, and its surroundings. Consider for instance the pose of a human or whether a vehicle is incoming or drives in the same lane as an ego-vehicle.

In this work, we adopt the transformer-based end-to-end object detection method DETR [3]

. In DETR, images are first passed through a convolutional neural network (CNN),


providing deep feature maps . Here, is the feature dimension and the spatial resolution, , is typically downsampled relative the original image size. Next, a transformer encoder further processes the deep feature maps,


The transformer has the capacity to add global context and model long-range relationships, providing a rich representation of the image, . Finally, a transformer decoder is applied. The decoder has a learnt object slot representation , corresponding to possible objects, that is refined by successively cross-attending the deep image features,


Each slot represents a potential object expected to be visible at time and predicted at time . The object classification scores and bounding box parameters are predicted using two small heads,


The model structure, corresponding to (3)-(6), is illustrated in Figure 2.

In DETR, the neural network is trained end-to-end for the object detection task by first matching the predicted objects, , to annotated objects, , and then, based on the matching, compute a loss. For future object prediction, we adopt the same loss, but computed against the annotated objects at time , , where enumerates the annotated objects. To compute the matching, a matching score is computed between each pair of predicted and annotated objects,


Essentially, the class and bounding box of each predicted object is compared to each . The matching scores are fed into the Hungarian algorithm which then computes the matching. Each predicted object is assigned an index , corresponding to a matched annotated object or to background. The objective function is then computed as


Here, corresponds to no object. We use the same matching cost functions and

, as well as the objective loss functions

and , as in Conditional DETR [33].

3.2 Capturing Dynamics with Spatiotemporal Mechanisms

Making predictions about the future requires information about the dynamics. A classical pipeline would typically detect objects in multiple frames and then create tracks from the detections in order to estimate their dynamics. Based on the estimated dynamics, predictions can be made about the future. Given only a single image, it may be possible to make good guesses about the dynamics. For instance, cars typically move along their lane approximately at the speed limit. For an animal, the pose of its body can convey the direction of movement and some information about its speed. However, using a single image to guess the dynamics cannot be done with precision and we can easily find examples where the guess is completely incorrect. For instance, a vehicle that is moving very slowly or a vehicle that is reversing. We would therefore like to capture spatiotemporal information. To this end, we extend DETR to process sequences of images.

To isolate the effect of the spatiotemporal mechanism, we keep the architecture as close to the single-frame version as possible. When the model is provided with a single image as input, it is identical to the single-frame Conditional-DETR described earlier. The first question to answer is at which stage of the network information from different frames should be merged. There are three natural locations: the backbone, the encoder, and the decoder. We always assume that CNN features are extracted separately from each image and do not consider earlier temporal fusion. This is both due to computational reasons, and because we believe that transformers are better suited for the task of temporal fusion. This still leaves us with a number of options. For instance, we can merge the image features immediately after the CNN encoder, so that . Alternatively we can keep the transformer encoder non-temporal, and instead let the transformer decoder directly make use of the features from multiple frames, . Naturally, we do not have to restrict ourselves to a single fusion location, and several of these approaches can be used simultaneously.

Joint Attention The next question is how we should modify the transformer encoder and decoder to make use of the additional inputs. Due to how transformers operate on unordered sets of tokens, one simple approach is to concatenate the different sets of tokens into a single, larger, set of tokens. This means that , and . These alternatives are visualized in the top-right part of Figures 3 and 4 respectively. However, pure concatenation removes all information about which tokens belong to which frame. To resolve this issue, we add a sinusoidal temporal positional encoding – analogous with the spatial positional encoding that is already used in DETR. Instead of encoding the absolute pixel position, we encode the temporal offset between frames (in seconds). This type of fusion puts minimal restrictions on the mechanism and should in principle be the most flexible approach. However, this approach hinges on good interactions with the temporal encoding. It is also computationally expensive when applied in the encoder, since the number of token interactions scale quadratically with the number of frames.

Sequential Cross-Attention A potentially more efficient approach is to use sequential cross-attention. Instead of attending to all tokens from all frames at once, we add additional cross-attention layers that attend to the information from previous frames. In the case of the decoder, we simply duplicate the existing cross-attention step for each temporal frame, so that a single decoder layer consists of one self-attention followed by several cross-attentions, see bottom left in Figure 4. We adopt a similar approach for the encoder, and introduce separate cross-attentions after the self-attention, see bottom left in Figure 3. This has some advantages: on one hand it removes the quadratic scaling of computational resources. On the other hand it removes the need for explicit temporal positional encodings, since the temporal dimension is directly baked into the network architecture. To the best of our knowledge, this is a novel adaptation of the popular attention mechanism. Closest in spirit is perhaps axial attention [17, 46], where each axis (height, width and time) is processed separately.

Recurrent Transformer We also experiment with a recurrent solution, where the final object embeddings of the previous frame are used as input to the transformer decoder, together with the new features. In principle, this allows the network to learn an explicit tracking solution, where detections in one frame are used as the starting point in the next frame. With this approach, the decoder and encoder can be expressed as , and . The bottom right parts of Figures 3 and 4 show the recurrent solution, where an additional cross-attention mechanism processes the output of the encoder or decoder, respectively, from the previous time-step.

3.3 Incorporating Ego-motion Information

A major challenge in future object prediction is that the motion of dynamic objects is highly correlated with the ego-motion of the camera. Autonomous robots often move and even objects that are stationary in the three-dimensional sense could move drastically in the image plane. It is possible to rely on multiple images to estimate and compensate for the ego-motion using visual odometry (VO) and geometry. In principle, the neural network could learn VO and use it as an internal representation. However, many systems are equipped with additional sensors that capture ego-motion information. We hypothesize that such information could constitute a powerful cue for the network to use.

Encoding While the exact sensor setups and ways of extracting ego motion can vary between different robots, we believe it is reasonable to assume that one has access to at least some subset of position, velocity, acceleration, rotation and rotation rate. Some of this information may be three-dimensional, but in the case of autonomous vehicles it could be projected to the road plane. For maximum flexibility, and generalization to slightly different setups, we make minimal assumptions on which ego-motion information is available. Our only processing is to convert information into a local coordinate system. For example global position or rotation is converted into a transformation relative to the last frame. We concatenate all available information and pass it through a 2-layer MLP to create an encoded ego-motion vector


where denotes vector concatenation.

Fusion Mechanism The question of where to incorporate the ego-motion information into the network is largely similar to the previously studied question of where to include information from past images. Since we strive to put minimal assumptions on what information is available and how it should be used, we simply use sequential cross-attention to incorporate the encoded ego-motion vector. This naturally allows for joint fusion with past images, where or . We experiment with a third approach where the ego-motion information is added to the features, much like the positional encodings used in attention. That is, .

4 Experiments

We implement the different approaches in PyTorch 

[35] and experiment on the NuScenes and NuImages datasets [2]. The code will be made publicly available to facilitate future research on future object prediction or for other tasks with the models used in this work. First, we introduce the DETR-based approach, trained for future object prediction, together with different spatiotemporal extensions. Second, we introduce ego-motion-information to the model. Third, we compare the performance to a naïve approach, a tracking-based approach, the mechanism proposed by Luc  [30], and an oracle. We also provide a comparison for different levels of future prediction. Last, we compare the performance to a lightweight oracle.

4.1 Experimental Setup

NuImages and NuScenes Although we need video datasets, there is no need for the entire video to be annotated. Both training and evaluation is possible with a single frame of the video being annotated. To this end, we experiment with NuScenes [2] and the subsequent dataset NuImages. NuImages contains samples of 13 images collected at 2 Hz; ego-motion information for each image; and object detection annotations for the 7th image. To simplify the problem we only consider front camera images, leaving us with 13066 training samples and 3215 validation samples. NuScenes is primarily intended for 3D object detection using the full sensor suite, and provides 3D annotations at 2Hz. However, projected 2D bounding boxes are also available. NuScenes is incosistent in the camera sampling frequency, since the 20Hz camera is resampled to 12Hz. During training we can be flexible with the sampling frequency, leaving us with a training set of 26k front camera samples. However, during evaluation, we discard samples which do not match the desired sampling frequency exactly. This leaves 5630 samples at 2Hz, 5759 samples at 4Hz, 2913 samples at 10Hz and 2884 samples at 20Hz. NuImages and NuScenes has two major advantages for future object prediction – advantages that are natural for datasets aimed towards autonomous robots. First, each dataset is captured with a single sensor suite, avoiding for instance issues with different cameras. Second, ego-motion information is included, which we hypothesize is a powerful cue for future object prediction.

Implementation Details We adopt ConditionalDETR [33] for our experiments due to its fast training. We use a ResNet50 [16]

backbone, pre-trained on ImageNet 


, and the same encoder and decoder hyperparameters as in ConditionalDETR. We train the approaches for future object prediction using the same set-prediction loss function 

[33]. We optimize with AdamW [29], using a batch size of 16 and weight decay of . We use base learning rates of for the backbone and

for the encoder, decoder, and linear heads. The learning rates are warmed up for the first 10% epochs and decayed by total factors of

when 60% and 90% of epochs have been processed. During the first 60% epochs, we train with half resolution and double batch size. We train for 400 epochs on NuImages and 160 epochs on the relatively larger NuScenes.

Evaluation We aim to predict objects present in a future frame. Thus, the detectors are fed images and expected to produce detections for one time-step into the future. Performance is measured in terms of Average Precision (AP). We report both average precision with a IoU threshold (AP50) and averaged over the thresholds (AP). We report both an average score over all classes and for the car and pedestrian classes. We also report results for objects of different sizes, using the cutoffs and , where is the original image resolution.

4.2 Quantitative Results

Spatiotemporal Ablation In Section 3, we introduced three mechanisms that each enables the neural network to process video data. Each mechanism can be applied either in the encoder or the decoder. We compare the different alternatives and report the results in Table 1. Compared to the single-frame version, trained for future prediction but with just a single frame, all spatiotemporal variants improve performance substantially. This indicates that each of the three different mechanisms learns to capture and make use of the dynamics in the scene. The best results are achieved by sequential cross-attention placed in the decoder, obtaining 23.9 class-averaged AP50. The joint attention mechanism placed in the decoder achieves competitive performance, 23.5 class-averaged AP50. The recurrent transformer provides inferior performance. Interestingly, joint and sequential attention work best in the decoder, while the recurrent mechanism obtain the best results in the encoder. Based on these results, we adopt the sequential cross-attention mechanism and place it in the decoder.

Method AP50 AP50 AP
Mean Car Pedestrian Small Medium Large Mean Car Pedestrian
Singleframe 13.5 22.5 6.2 3.0 12.9 30.8 3.8 6.7 1.5
Joint Attention Encoder 22.9 34.8 10.5 3.8 23.1 49.2 7.4 11.8 2.7
Joint Attention Decoder 23.5 35.4 11.0 4.2 23.7 52.0 7.9 11.9 2.8
Sequential CA Encoder 21.9 35.0 9.7 4.7 21.9 48.1 7.1 11.9 2.5
Sequential CA Decoder 23.9 36.6 12.0 4.1 25.1 52.1 8.1 12.1 3.1
Recurrent Tr. Encoder 21.6 34.9 9.5 3.7 21.9 44.2 7.3 11.8 2.4
Recurrent Tr. Decoder 21.2 32.6 9.2 3.6 22.1 42.4 6.6 10.8 2.3
Table 1: Future object prediction performance for different spatiotemporal mechanisms on the NuImages validation set. The best mechanism is marked in red and the second best in blue.
Method AP50 AP50 AP
Mean Car Pedestrian Small Medium Large Mean Car Pedestrian
No ego-motion 23.9 36.6 12.0 4.1 25.1 52.1 8.1 12.1 3.1
Add ego-motion to features 28.9 42.3 14.7 6.8 30.5 55.6 10.1 14.9 4.0
Attend ego-motion in encoder 28.0 43.2 15.1 4.7 29.9 59.3 9.4 15.5 3.4
Attend ego-motion in decoder 28.3 42.4 13.7 4.7 29.8 59.1 10.0 14.9 3.6
Table 2: Future object prediction performance on the NuImages validation set with and without ego-motion.

Ego-motion Ablation Next, we introduce ego-motion information to our spatiotemporal neural network. In Section-3.3, we described how to encode the ego-motion information and incorporate it into the network using sequential cross-attention. As an alternative, we also interpret as a learnable positional encoding, and simply add it to the CNN features, as is typically done with positional encodings [3]. We evaluate the effectiveness of these approaches in Table 2. We can clearly see that ego-motion is a powerful cue, and all approaches strongly outperform the method without ego-motion. Interestingly, even the simple mechanism of interpreting it as a positional encoding works very well. We choose to prioritize performance on the Car and Pedestrian categories, and therefore adopt sequential cross-attention in the encoder in further experiments.

Quantitative Comparison We compare our single-frame and spatiotemporal approach to a naïve baseline, a tracking-based baseline, and an adaption of prior art. The naïve baseline corresponds to the scenario where no future prediction is done. We run standard object detection on the past image () and treat the detections as the future prediction. The tracking-based baseline follows the baseline used in FIERY [19]: The centerpoint distance between pairs of detections in two different frames is used as cost and then the Hungarian algorithm is used to minimize the total cost. Matched objects are extrapolated linearly into the future frame. While it was not possible to directly compare against prior work, we found F2F[30] to be the most applicable. We keep our DETR-like approach but replace the transformer encoder with the F2F module, which takes features from multiple frames and learns to forecast a future feature map. Finally, the oracle acts as a best-case reference, providing the performance of a single-frame detector applied directly to the future frame.

The results are shown in Table 3

. The naive approach fares very poorly, indicating large changes in the scene. The tracking solution improves the performance across the board, but still struggles. Likely, the simple matching and extrapolation heuristics prove insufficient over the large prediction horizon. The F2F adaptation proves much stronger, in several cases more than twice as good as the tracking solution. Interestingly, despite making use of multiple frames, it is consistently beaten by our Singleframe + Ego solution. This clearly demonstrates the advantage of incorporating ego-motion. Finally, our spatiotemporal version outperforms the other approaches for all but the smallest objects. However, we note that there is still a substantial gap to the oracle, likely due to the stochastic nature of the future.

Method AP50 AP50 AP
Mean Car Pedestrian Small Medium Large Mean Car Pedestrian
Oracle 67.5 87.2 73.5 42.0 69.5 84.7 38.4 56.2 35.7
Naïve 9.8 14.5 2.5 3.6 9.4 18.8 3.0 4.2 0.6
Tracking 14.8 16.9 3.6 3.3 14.8 25.9 4.5 5.0 0.8
F2F adaptation [30] 21.6 35.6 10.8 3.3 22.0 42.0 7.2 12.2 2.7
Singleframe (Ours) 13.5 22.5 6.2 3.0 12.9 30.8 3.8 6.7 1.5
Spatiotemporal (Ours) 23.9 36.6 12.0 4.1 25.1 52.1 8.1 12.1 3.1
Singleframe + Ego (Ours) 25.6 38.1 12.8 5.1 26.9 54.8 8.9 13.2 3.4
Spatiotemporal + Ego (Ours) 28.0 43.2 14.7 4.7 29.9 59.3 9.4 15.5 3.8
Table 3: Performance for future object prediction on the nuImages validation set in terms of AP50 and AP (higher is better). The prediction horizon is 500 ms.
Method 50 ms 100 ms 250 ms 500 ms
Car Pedestrian Car Pedestrian Car Pedestrian Car Pedestrian
Oracle 71.4 49.9 69.7 46.5 72.1 49.9 72.3 49.8
Naïve 70.4 41.7 64.6 25.6 47.0 12.1 27.4 4.9
Tracking 70.9 43.6 66.1 36.6 50.6 23.6 30.9 10.0
Singleframe + Ego (Ours) 71.3 47.7 69.6 44.1 62.9 33.3 50.9 17.8
Spatiotemporal + Ego (Ours) 72.4 48.3 70.1 44.2 65.6 35.8 54.9 21.5
Table 4: Performance for future object prediction on the NuScenes validation set in terms of AP50 (higher is better) for multiple prediction horizons.

Part of the motivation behind future prediction is to compensate for the latency induced by the object detector. For the relatively long horizon of 500 ms, compared to typical network latencies, we have seen that the zero-latency oracle has much better performance. Table 4 compares the performance of our future predictions with the oracle, for a number of prediction horizons. As expected, the gap shrinks for shorter horizons. More surprisingly, we find a threshold, near 100ms, where the future predictions are roughly on par with the oracle. In the case of cars, we even outperform the oracle, likely thanks to the addition of ego-motion and spatiotemporal cues. These results suggests an additional element to the traditional latency-accuracy tradeoff – the future prediction horizon.

Figure 5: Visualization of a two future predictions with corresponding attention maps. The top row shows an object that moves swiftly across the image. The bottom row (zoomed) shows a low-visibility pedestrian that is standing still.
Figure 6: Three challenging sequences where our approach predicts the future. The left and center images are two consecutive frames, and the right-most image is the third, future frame. Based on the two left-most frames, our approach predicts the objects in the third, future frame.

4.3 Qualitative Results

In Figure 5, we visualize the attention maps for two sequences. In each sequence, we show a single predicted, future object. For the predicted object, we visualized the attention maps from the object slot to different regions in the two input images. For another example, see Figure 1. In all scenarios, our approach manages to find and attend to the same object in multiple frames. Using only a single annotated frame during training, a rudimentary form of tracking seems to emerge. In Figure 6, we show qualitative results of our spatiotemporal approach on three challenging sequences. These three sequences exhibit ego-motion, moving objects, crowded scenes, and adverse weather condition. Even in such challenging conditions, the approach manages to accurately predict future object states.

5 Conclusion

We have explored the task of future object prediction. A transformer-based object detector was trained for this task and adapted to process videos in order to capture the dynamics in the scene. The cross-attention mechanism provides a straightforward means to process videos and several different mechanisms were investigated. All provided substantial performance improvements over the baseline. Using sequential cross-attention lead to the best performance. Furthermore, the addition of ego-motion information was investigated. Ego-motion was successfully fed to the neural network either in a fashion similar to positional encodings, or via additional cross-attention mechanisms. This yielded major performance improvements for both the single-frame model and for the spatiotemporal model. We hope for this work to serve as a basis for future work on future object prediction.

Acknowledgements This work was partially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation. The computations were enabled by resources provided by the Swedish National Infrastructure for Computing (SNIC), partially funded by the Swedish Research Council through grant agreement no. 2018-05973.


  • [1] W. Byeon, Q. Wang, R. K. Srivastava, and P. Koumoutsakos (2018) Contextvp: fully context-aware video prediction. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 753–769. Cited by: §2.
  • [2] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom (2020) Nuscenes: a multimodal dataset for autonomous driving. In

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    pp. 11621–11631. Cited by: item 1, §4.1, §4.
  • [3] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. In European Conference on Computer Vision, pp. 213–229. Cited by: Figure 2, §1, §1, §2, §3.1, §3, §4.2.
  • [4] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308. Cited by: §2.
  • [5] L. Castrejon, N. Ballas, and A. Courville (2019) Improved conditional vrnns for video prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7608–7617. Cited by: §2.
  • [6] Y. Chai, B. Sapp, M. Bansal, and D. Anguelov (2020) MultiPath: multiple probabilistic anchor trajectory hypotheses for behavior prediction. In Conference on Robot Learning, pp. 86–99. Cited by: §2.
  • [7] C. Chi, F. Wei, and H. Hu (2020) Relationnet++: bridging visual representations for object detection via transformer decoder. Advances in Neural Information Processing Systems 33, pp. 13564–13574. Cited by: §2.
  • [8] X. Dai, Y. Chen, J. Yang, P. Zhang, L. Yuan, and L. Zhang (2021) Dynamic detr: end-to-end object detection with dynamic attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2988–2997. Cited by: §2.
  • [9] Z. Dai, B. Cai, Y. Lin, and J. Chen (2021) Up-detr: unsupervised pre-training for object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1601–1610. Cited by: §1, §2.
  • [10] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §4.1.
  • [11] B. Duke, A. Ahmed, C. Wolf, P. Aarabi, and G. W. Taylor (2021) Sstvos: sparse spatiotemporal transformers for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5912–5921. Cited by: §2.
  • [12] P. Gao, M. Zheng, X. Wang, J. Dai, and H. Li (2021) Fast convergence of detr with spatially modulated co-attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3621–3630. Cited by: §1, §2.
  • [13] R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman (2019)

    Video action transformer network

    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 244–253. Cited by: §2, §2.
  • [14] K. Greff, S. Van Steenkiste, and J. Schmidhuber (2020) On the binding problem in artificial neural networks. arXiv preprint arXiv:2012.05208. Cited by: footnote 1.
  • [15] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §2.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1.
  • [17] J. Ho, N. Kalchbrenner, D. Weissenborn, and T. Salimans (2019) Axial attention in multidimensional transformers. External Links: 1912.12180 Cited by: §3.2.
  • [18] J. Hong, B. Sapp, and J. Philbin (2019) Rules of the road: predicting driving behavior with a convolutional model of semantic interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8454–8462. Cited by: §2.
  • [19] A. Hu, Z. Murez, N. Mohan, S. Dudas, J. Hawke, V. Badrinarayanan, R. Cipolla, and A. Kendall (2021) FIERY: future instance prediction in bird’s-eye view from surround monocular cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15273–15282. Cited by: §1, §2, §2, §4.2.
  • [20] M. Jaderberg, K. Simonyan, A. Zisserman, et al. (2015) Spatial transformer networks. Advances in neural information processing systems 28. Cited by: §2.
  • [21] A. Jaegle, F. Gimeno, A. Brock, O. Vinyals, A. Zisserman, and J. Carreira (2021) Perceiver: general perception with iterative attention. In

    International Conference on Machine Learning

    pp. 4651–4664. Cited by: §2, §2.
  • [22] N. Kalchbrenner, A. Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu (2017) Video pixel networks. In International Conference on Machine Learning, pp. 1771–1779. Cited by: §1, §2.
  • [23] H. Law and J. Deng (2018) Cornernet: detecting objects as paired keypoints. In Proceedings of the European conference on computer vision (ECCV), pp. 734–750. Cited by: §1, §2.
  • [24] M. Li, Y. Wang, and D. Ramanan (2020) Towards streaming perception. External Links: 2005.10420 Cited by: §2.
  • [25] X. Liang, L. Lee, W. Dai, and E. P. Xing (2017) Dual motion gan for future-flow embedded video prediction. In proceedings of the IEEE international conference on computer vision, pp. 1744–1752. Cited by: §2.
  • [26] F. Liu, H. Wei, W. Zhao, G. Li, J. Peng, and Z. Li (2021) WB-detr: transformer-based detector without backbone. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2979–2987. Cited by: §1.
  • [27] W. Liu, A. Sharma, O. Camps, and M. Sznaier (2018) Dyan: a dynamical atoms-based network for video prediction. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 170–185. Cited by: §2.
  • [28] Z. Liu, R. A. Yeh, X. Tang, Y. Liu, and A. Agarwala (2017) Video frame synthesis using deep voxel flow. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4463–4471. Cited by: §2.
  • [29] I. Loshchilov and F. Hutter (2018) Fixing weight decay regularization in adam. Cited by: §4.1.
  • [30] P. Luc, C. Couprie, Y. Lecun, and J. Verbeek (2018) Predicting future instance segmentation by forecasting convolutional features. In Proceedings of the european conference on computer vision (ECCV), pp. 584–599. Cited by: §1, §2, §4.2, Table 3, §4.
  • [31] M. Mathieu, C. Couprie, and Y. LeCun (2016) Deep multi-scale video prediction beyond mean square error. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §1, §2.
  • [32] T. Meinhardt, A. Kirillov, L. Leal-Taixe, and C. Feichtenhofer (2021) Trackformer: multi-object tracking with transformers. arXiv preprint arXiv:2101.02702. Cited by: §1, §2, §2.
  • [33] D. Meng, X. Chen, Z. Fan, G. Zeng, H. Li, Y. Yuan, L. Sun, and J. Wang (2021) Conditional detr for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3651–3660. Cited by: Figure 2, §1, §2, §3.1, §4.1.
  • [34] I. Misra, R. Girdhar, and A. Joulin (2021) An end-to-end transformer model for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2906–2917. Cited by: §1.
  • [35] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019)

    Pytorch: an imperative style, high-performance deep learning library

    Advances in neural information processing systems 32. Cited by: §4.
  • [36] T. Prangemeier, C. Reich, and H. Koeppl (2020) Attention-based transformers for instance segmentation of cells in microstructures. In 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 700–707. Cited by: §1.
  • [37] M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra (2014) Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604. Cited by: §1, §2.
  • [38] J. Redmon and A. Farhadi (2017) YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7263–7271. Cited by: §1.
  • [39] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Advances in neural information processing systems 28. Cited by: §1, §2.
  • [40] N. Srivastava, E. Mansimov, and R. Salakhudinov (2015) Unsupervised learning of video representations using lstms. In International conference on machine learning, pp. 843–852. Cited by: §1, §2.
  • [41] O. Styles, V. Sanchez, and T. Guha (2020) Multiple object forecasting: predicting future object locations in diverse environments. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 690–699. Cited by: §1, §2.
  • [42] C. Tang and R. R. Salakhutdinov (2019) Multiple futures prediction. Advances in Neural Information Processing Systems 32. Cited by: §2.
  • [43] Z. Tian, C. Shen, H. Chen, and T. He (2019) Fcos: fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 9627–9636. Cited by: §2.
  • [44] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee (2017) Decomposing motion and content for natural video sequence prediction. In 5th International Conference on Learning Representations, ICLR 2017, Cited by: §2.
  • [45] R. Villegas, J. Yang, Y. Zou, S. Sohn, X. Lin, and H. Lee (2017) Learning to generate long-term future via hierarchical prediction. In international conference on machine learning, pp. 3560–3569. Cited by: §2.
  • [46] H. Wang, Y. Zhu, B. Green, H. Adam, A. Yuille, and L. Chen (2020) Axial-deeplab: stand-alone axial-attention for panoptic segmentation. External Links: 2003.07853 Cited by: §3.2.
  • [47] T. Wang, L. Yuan, Y. Chen, J. Feng, and S. Yan (2021) PnP-detr: towards efficient visual analysis with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4661–4670. Cited by: §2.
  • [48] Y. Wang, Z. Xu, X. Wang, C. Shen, B. Cheng, H. Shen, and H. Xia (2021) End-to-end video instance segmentation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8741–8750. Cited by: §2, §2.
  • [49] Y. Wu, R. Gao, J. Park, and Q. Chen (2020) Future video synthesis with object motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5539–5548. Cited by: §1, §2.
  • [50] B. Yan, H. Peng, J. Fu, D. Wang, and H. Lu (2021) Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10448–10457. Cited by: §1.
  • [51] X. Zhou, D. Wang, and P. Krähenbühl (2019) Objects as points. arXiv preprint arXiv:1904.07850. Cited by: §2.
  • [52] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai (2020) Deformable detr: deformable transformers for end-to-end object detection. In International Conference on Learning Representations, Cited by: §1, §2.