MODETR: Moving Object Detection with Transformers

by   Eslam Mohamed, et al.

Moving Object Detection (MOD) is a crucial task for the Autonomous Driving pipeline. MOD is usually handled via 2-stream convolutional architectures that incorporates both appearance and motion cues, without considering the inter-relations between the spatial or motion features. In this paper, we tackle this problem through multi-head attention mechanisms, both across the spatial and motion streams. We propose MODETR; a Moving Object DEtection TRansformer network, comprised of multi-stream transformer encoders for both spatial and motion modalities, and an object transformer decoder that produces the moving objects bounding boxes using set predictions. The whole architecture is trained end-to-end using bi-partite loss. Several methods of incorporating motion cues with the Transformer model are explored, including two-stream RGB and Optical Flow (OF) methods, and multi-stream architectures that take advantage of sequence information. To incorporate the temporal information, we propose a new Temporal Positional Encoding (TPE) approach to extend the Spatial Positional Encoding(SPE) in DETR. We explore two architectural choices for that, balancing between speed and time. To evaluate the our network, we perform the MOD task on the KITTI MOD [6] data set. Results show significant 5 the Transformer network for MOD over the state-of-the art methods. Moreover, the proposed TPE encoding provides 10


Spatio-Temporal Multi-Task Learning Transformer for Joint Moving Object Detection and Segmentation

Moving objects have special importance for Autonomous Driving tasks. Det...

MODNet: Moving Object Detection Network with Motion and Appearance for Autonomous Driving

We propose a novel multi-task learning system that combines appearance a...

RST-MODNet: Real-time Spatio-temporal Moving Object Detection for Autonomous Driving

Moving Object Detection (MOD) is a critical task for autonomous vehicles...

Confidence-guided Adaptive Gate and Dual Differential Enhancement for Video Salient Object Detection

Video salient object detection (VSOD) aims to locate and segment the mos...

Can Transformer Attention Spread Give Insights Into Uncertainty of Detected and Tracked Objects?

Transformers have recently been utilized to perform object detection and...

SODFormer: Streaming Object Detection with Transformer Using Events and Frames

DAVIS camera, streaming two complementary sensing modalities of asynchro...

Two-stream convolutional networks for end-to-end learning of self-driving cars

We propose a methodology to extend the concept of Two-Stream Convolution...

Please sign up or login with your details

Forgot password? Click here to reset