DeepAI AI Chat
Log In Sign Up

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

by   Humam Alwassel, et al.

Understanding videos is challenging in computer vision. In particular, the large memory footprint of an untrimmed video makes most tasks infeasible to train end-to-end without dropping part of the input data. To cope with the memory limitation of commodity GPUs, current video localization models encode videos in an offline fashion. Even though these encoders are learned, they are typically trained for action classification tasks at the frame- or clip-level. Since it is difficult to finetune these encoders for other video tasks, they might be sub-optimal for temporal localization tasks. In this work, we propose a novel, supervised pretraining paradigm for clip-level video representation that does not only train to classify activities, but also considers background clips and global video information to gain temporal sensitivity. Extensive experiments show that features extracted by clip-level encoders trained with our novel pretraining task are more discriminative for several temporal localization tasks. Specifically, we show that using our newly trained features with state-of-the-art methods significantly improves performance on three tasks: Temporal Action Localization (+1.72 +4.4 ActivityNet), and Dense Video Captioning (+0.31 ActivityNet Captions). We believe video feature encoding is an important building block for many video algorithms, and extracting meaningful features should be of paramount importance in the effort to build more accurate models.


page 1

page 8

page 18


Unsupervised Pre-training for Temporal Action Localization Tasks

Unsupervised video representation learning has made remarkable achieveme...

Video Time: Properties, Encoders and Evaluation

Time-aware encoding of frame sequences in a video is a fundamental probl...

Ego-Only: Egocentric Action Detection without Exocentric Pretraining

We present Ego-Only, the first training pipeline that enables state-of-t...

Contrastive Language-Action Pre-training for Temporal Localization

Long-form video understanding requires designing approaches that are abl...

Exploring Frame Segmentation Networks for Temporal Action Localization

Temporal action localization is an important task of computer vision. Th...

Temporal Convolution Networks with Positional Encoding for Evoked Expression Estimation

This paper presents an approach for Evoked Expressions from Videos (EEV)...