End-to-End Video Object Detection with Spatial-Temporal Transformers

05/23/2021
by   Lu He, et al.
29

Recently, DETR and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, an end-to-end video object detection model based on a spatial-temporal Transformer architecture. The goal of this paper is to streamline the pipeline of VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow, recurrent neural networks, relation networks. Besides, benefited from the object query design in DETR, our method does not need complicated post-processing methods such as Seq-NMS or Tubelet rescoring, which keeps the pipeline simple and clean. In particular, we present temporal Transformer to aggregate both the spatial object queries and the feature memories of each frame. Our temporal Transformer consists of three components: Temporal Deformable Transformer Encoder (TDTE) to encode the multiple frame spatial details, Temporal Query Encoder (TQE) to fuse object queries, and Temporal Deformable Transformer Decoder to obtain current frame detection results. These designs boost the strong baseline deformable DETR by a significant margin (3 ImageNet VID dataset. TransVOD yields comparable results performance on the benchmark of ImageNet VID. We hope our TransVOD can provide a new perspective for video object detection. Code will be made publicly available at https://github.com/SJTU-LuHe/TransVOD.

READ FULL TEXT

page 4

page 7

page 8

research
01/13/2022

TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers

Detection Transformer (DETR) and Deformable DETR have been proposed to e...
research
07/01/2023

Spatial-Temporal Enhanced Transformer Towards Multi-Frame 3D Object Detection

The Detection Transformer (DETR) has revolutionized the design of CNN-ba...
research
10/08/2020

Deformable DETR: Deformable Transformers for End-to-End Object Detection

DETR has been recently proposed to eliminate the need for many hand-desi...
research
09/06/2022

PTSEFormer: Progressive Temporal-Spatial Enhanced TransFormer Towards Video Object Detection

Recent years have witnessed a trend of applying context frames to boost ...
research
05/26/2020

End-to-End Object Detection with Transformers

We present a new method that views object detection as a direct set pred...
research
02/18/2022

Task Specific Attention is one more thing you need for object detection

Various models have been proposed to solve the object detection problem....
research
01/06/2023

End-to-End 3D Dense Captioning with Vote2Cap-DETR

3D dense captioning aims to generate multiple captions localized with th...

Please sign up or login with your details

Forgot password? Click here to reset