TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers

by   Qianyu Zhou, et al.

Detection Transformer (DETR) and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, the first end-to-end video object detection system based on spatial-temporal Transformer architectures. The first goal of this paper is to streamline the pipeline of VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow model, relation networks. Besides, benefited from the object query design in DETR, our method does not need complicated post-processing methods such as Seq-NMS. In particular, we present a temporal Transformer to aggregate both the spatial object queries and the feature memories of each frame. Our temporal transformer consists of two components: Temporal Query Encoder (TQE) to fuse object queries, and Temporal Deformable Transformer Decoder (TDTD) to obtain current frame detection results. These designs boost the strong baseline deformable DETR by a significant margin (3 dataset. Then, we present two improved versions of TransVOD including TransVOD++ and TransVOD Lite. The former fuses object-level information into object query via dynamic convolution while the latter models the entire video clips as the output to speed up the inference time. We give detailed analysis of all three models in the experiment part. In particular, our proposed TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet VID with 90.0 accuracy trade-off with 83.7 V100 GPU device. Code and models will be available for further research.


page 3

page 6

page 7

page 12

page 15


End-to-End Video Object Detection with Spatial-Temporal Transformers

Recently, DETR and Deformable DETR have been proposed to eliminate the n...

Deformable DETR: Deformable Transformers for End-to-End Object Detection

DETR has been recently proposed to eliminate the need for many hand-desi...

Spatial-Temporal Enhanced Transformer Towards Multi-Frame 3D Object Detection

The Detection Transformer (DETR) has revolutionized the design of CNN-ba...

FAQ: Feature Aggregated Queries for Transformer-based Video Object Detectors

Video object detection needs to solve feature degradation situations tha...

Video Monitoring Queries

Recent advances in video processing utilizing deep learning primitives a...

Bridging the Performance Gap between DETR and R-CNN for Graphical Object Detection in Document Images

This paper takes an important step in bridging the performance gap betwe...

Task Specific Attention is one more thing you need for object detection

Various models have been proposed to solve the object detection problem....

Please sign up or login with your details

Forgot password? Click here to reset