LocFormer: Enabling Transformers to Perform Temporal Moment Localization on Long Untrimmed Videos With a Feature Sampling Approach

12/19/2021
by   Cristian Rodriguez Opazo, et al.
0

We propose LocFormer, a Transformer-based model for video grounding which operates at a constant memory footprint regardless of the video length, i.e. number of frames. LocFormer is designed for tasks where it is necessary to process the entire long video and at its core lie two main contributions. First, our model incorporates a new sampling technique that splits the input feature sequence into a fixed number of sections and selects a single feature per section using a stochastic approach, which allows us to obtain a feature sample set that is representative of the video content for the task at hand while keeping the memory footprint constant. Second, we propose a modular design that separates functionality, enabling us to learn an inductive bias via supervising the self-attention heads, while also effectively leveraging pre-trained text and video encoders. We test our proposals on relevant benchmark datasets for video grounding, showing that not only LocFormer can achieve excellent results including state-of-the-art performance on YouCookII, but also that our sampling technique is more effective than competing counterparts and that it consistently improves the performance of prior work, by up to 3.13% in the mean temporal IoU, ultimately leading to a new state-of-the-art performance on Charades-STA.

READ FULL TEXT

page 3

page 13

research
08/02/2022

Two-Stream Transformer Architecture for Long Video Understanding

Pure vision transformer architectures are highly effective for short vid...
research
09/20/2023

Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding

While most modern video understanding models operate on short-range clip...
research
02/26/2023

Localizing Moments in Long Video Via Multimodal Guidance

The recent introduction of the large-scale long-form MAD dataset for lan...
research
04/04/2021

EKO: Adaptive Sampling of Compressed Video Data

Researchers have presented systems for efficiently analysing video data ...
research
11/23/2020

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

Understanding videos is challenging in computer vision. In particular, t...
research
03/18/2021

Space-Time Crop Attend: Improving Cross-modal Video Representation Learning

The quality of the image representations obtained from self-supervised l...

Please sign up or login with your details

Forgot password? Click here to reset