Contrastive Language-Action Pre-training for Temporal Localization

04/26/2022
by   Mengmeng Xu, et al.
2

Long-form video understanding requires designing approaches that are able to temporally localize activities or language. End-to-end training for such tasks is limited by the compute device memory constraints and lack of temporal annotations at large-scale. These limitations can be addressed by pre-training on large datasets of temporally trimmed videos supervised by class annotations. Once the video encoder is pre-trained, it is common practice to freeze it during fine-tuning. Therefore, the video encoder does not learn temporal boundaries and unseen classes, causing a domain gap with respect to the downstream tasks. Moreover, using temporally trimmed videos prevents to capture the relations between different action categories and the background context in a video clip which results in limited generalization capacity. To address these limitations, we propose a novel post-pre-training approach without freezing the video encoder which leverages language. We introduce a masked contrastive learning loss to capture visio-linguistic relations between activities, background video clips and language in the form of captions. Our experiments show that the proposed approach improves the state-of-the-art on temporal action localization, few-shot temporal action localization, and video language grounding tasks.

READ FULL TEXT
research
11/21/2020

Boundary-sensitive Pre-training for Temporal Localization in Videos

Many video analysis tasks require temporal localization thus detection o...
research
07/21/2022

LocVTP: Video-Text Pre-training for Temporal Localization

Video-Text Pre-training (VTP) aims to learn transferable representations...
research
09/02/2022

Temporal Contrastive Learning with Curriculum

We present ConCur, a contrastive video representation learning method th...
research
04/19/2023

EC^2: Emergent Communication for Embodied Control

Embodied control requires agents to leverage multi-modal pre-training to...
research
11/25/2022

Re^2TAL: Rewiring Pretrained Video Backbones for Reversible Temporal Action Localization

Temporal action localization (TAL) requires long-form reasoning to predi...
research
03/28/2021

Low-Fidelity End-to-End Video Encoder Pre-training for Temporal Action Localization

Temporal action localization (TAL) is a fundamental yet challenging task...
research
11/23/2020

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

Understanding videos is challenging in computer vision. In particular, t...

Please sign up or login with your details

Forgot password? Click here to reset