TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

11/23/2020
by   Humam Alwassel, et al.
0

Understanding videos is challenging in computer vision. In particular, the large memory footprint of an untrimmed video makes most tasks infeasible to train end-to-end without dropping part of the input data. To cope with the memory limitation of commodity GPUs, current video localization models encode videos in an offline fashion. Even though these encoders are learned, they are typically trained for action classification tasks at the frame- or clip-level. Since it is difficult to finetune these encoders for other video tasks, they might be sub-optimal for temporal localization tasks. In this work, we propose a novel, supervised pretraining paradigm for clip-level video representation that does not only train to classify activities, but also considers background clips and global video information to gain temporal sensitivity. Extensive experiments show that features extracted by clip-level encoders trained with our novel pretraining task are more discriminative for several temporal localization tasks. Specifically, we show that using our newly trained features with state-of-the-art methods significantly improves performance on three tasks: Temporal Action Localization (+1.72 +4.4 ActivityNet), and Dense Video Captioning (+0.31 ActivityNet Captions). We believe video feature encoding is an important building block for many video algorithms, and extracting meaningful features should be of paramount importance in the effort to build more accurate models.

READ FULL TEXT

page 1

page 8

page 18

research
03/25/2022

Unsupervised Pre-training for Temporal Action Localization Tasks

Unsupervised video representation learning has made remarkable achieveme...
research
07/18/2018

Video Time: Properties, Encoders and Evaluation

Time-aware encoding of frame sequences in a video is a fundamental probl...
research
01/03/2023

Ego-Only: Egocentric Action Detection without Exocentric Pretraining

We present Ego-Only, the first training pipeline that enables state-of-t...
research
04/26/2022

Contrastive Language-Action Pre-training for Temporal Localization

Long-form video understanding requires designing approaches that are abl...
research
02/14/2019

Exploring Frame Segmentation Networks for Temporal Action Localization

Temporal action localization is an important task of computer vision. Th...
research
06/16/2021

Temporal Convolution Networks with Positional Encoding for Evoked Expression Estimation

This paper presents an approach for Evoked Expressions from Videos (EEV)...

Please sign up or login with your details

Forgot password? Click here to reset