Spatial-Temporal Memory Networks for Video Object Detection

12/18/2017
by   Fanyi Xiao, et al.
0

We introduce Spatial-Temporal Memory Networks (STMN) for video object detection. At its core, we propose a novel Spatial-Temporal Memory module (STMM) as the recurrent computation unit to model long-term temporal appearance and motion dynamics. The STMM's design enables the integration of ImageNet pre-trained backbone CNN weights for both the feature stack as well as the prediction head, which we find to be critical for accurate detection. Furthermore, in order to tackle object motion in videos, we propose a novel MatchTrans module to align the spatial-temporal memory from frame to frame. We compare our method to state-of-the-art detectors on ImageNet VID, and conduct ablative studies to dissect the contribution of our different design choices. We obtain state-of-the-art results with the VGG backbone, and competitive results with the ResNet backbone. To our knowledge, this is the first video object detector that is equipped with an explicit memory mechanism to model long-term temporal dynamics.

READ FULL TEXT

page 2

page 3

page 5

page 8

research
02/22/2022

Exploiting long-term temporal dynamics for video captioning

Automatically describing videos with natural language is a fundamental c...
research
09/14/2021

Space Time Recurrent Memory Network

We propose a novel visual memory network architecture for the learning a...
research
06/13/2023

E2E-LOAD: End-to-End Long-form Online Action Detection

Recently, there has been a growing trend toward feature-based approaches...
research
10/07/2021

MGPSN: Motion-Guided Pseudo Siamese Network for Indoor Video Head Detection

Head detection in real-world videos is an important research topic in co...
research
01/06/2018

ReMotENet: Efficient Relevant Motion Event Detection for Large-scale Home Surveillance Videos

This paper addresses the problem of detecting relevant motion caused by ...
research
07/20/2022

ViGAT: Bottom-up event recognition and explanation in video using factorized graph attention network

In this paper a pure-attention bottom-up approach, called ViGAT, that ut...
research
08/29/2019

Great Ape Detection in Challenging Jungle Camera Trap Footage via Attention-Based Spatial and Temporal Feature Blending

We propose the first multi-frame video object detection framework traine...

Please sign up or login with your details

Forgot password? Click here to reset