Spatio-temporal Tubelet Feature Aggregation and Object Linking in Videos

04/01/2020
by   Daniel Cores, et al.
0

This paper addresses the problem of how to exploit spatio-temporal information available in videos to improve the object detection precision. We propose a two stage object detector called FANet based on short-term spatio-temporal feature aggregation to give a first detection set, and long-term object linking to refine these detections. Firstly, we generate a set of short tubelet proposals containing the object in N consecutive frames. Then, we aggregate RoI pooled deep features through the tubelet using a temporal pooling operator that summarizes the information with a fixed size output independent of the number of input frames. On top of that, we define a double head implementation that we feed with spatio-temporal aggregated information for spatio-temporal object classification, and with spatial information extracted from the current frame for object localization and spatial classification. Furthermore, we also specialize each head branch architecture to better perform in each task taking into account the input data. Finally, a long-term linking method builds long tubes using the previously calculated short tubelets to overcome detection errors. We have evaluated our model in the widely used ImageNet VID dataset achieving a 80.9 the new state-of-the-art result for single models. Also, in the challenging small object detection dataset USC-GRAD-STDdb, our proposal outperforms the single frame baseline by 5.4

READ FULL TEXT

page 1

page 4

page 5

page 6

page 11

research
11/20/2018

A Proposal-Based Solution to Spatio-Temporal Action Detection in Untrimmed Videos

Existing approaches for spatio-temporal action detection in videos are l...
research
01/30/2018

Object Detection in Videos by Short and Long Range Object Linking

We address the problem of detecting objects in videos with the interest ...
research
06/20/2022

A Novel Long-term Iterative Mining Scheme for Video Salient Object Detection

The existing state-of-the-art (SOTA) video salient object detection (VSO...
research
04/30/2015

Predicting People's 3D Poses from Short Sequences

We propose an efficient approach to exploiting motion information from c...
research
07/31/2019

On the difficulty of learning and predicting the long-term dynamics of bouncing objects

The ability to accurately predict the surrounding environment is a found...
research
01/15/2013

Recurrent Online Clustering as a Spatio-Temporal Feature Extractor in DeSTIN

This paper presents a basic enhancement to the DeSTIN deep learning arch...
research
02/02/2023

Dynamic Atomic Column Detection in Transmission Electron Microscopy Videos via Ridge Estimation

Ridge detection is a classical tool to extract curvilinear features in i...

Please sign up or login with your details

Forgot password? Click here to reset