Learning Spatiotemporal Features via Video and Text Pair Discrimination

01/16/2020
by   Tianhao Li, et al.
5

Current video representations heavily rely on learning from manually annotated video datasets. However, it is expensive and time-consuming to acquire a large-scale well-labeled video dataset. We observe that videos are naturally accompanied with abundant text information such as YouTube titles and movie scripts. In this paper, we leverage this visual-textual connection to learn effective spatiotemporal features in an efficient weakly-supervised manner. We present a general cross-modal pair discrimination (CPD) framework to capture this correlation between a clip and its associated text, and adopt noise-contrastive estimation technique to tackle the computational issues imposed by the huge numbers of pair instance classes. Specifically, we investigate the CPD framework from two sources of video-text pairs, and design a practical curriculum learning strategy to train the CPD. Without further fine tuning, the learned models obtain competitive results for action classification on the Kinetics dataset under the common linear classification protocol. Moreover, our visual model provides a very effective initialization to fine-tune on the downstream task datasets. Experimental results demonstrate that our weakly-supervised pre-training yields a remarkable performance gain for action recognition on the datasets of UCF101 and HMDB51, compared with the state-of-the-art self-supervised training methods. In addition, our CPD model yields a new state of the art for zero-shot action recognition on UCF101 by directly utilizing the learnt visual-textual embedding.

READ FULL TEXT

page 1

page 3

research
03/09/2017

UntrimmedNets for Weakly Supervised Action Recognition and Detection

Current action recognition methods heavily rely on trimmed videos for mo...
research
06/21/2022

Bi-Calibration Networks for Weakly-Supervised Video Representation Learning

The leverage of large volumes of web videos paired with the searched que...
research
06/30/2018

Co-Training of Audio and Video Representations from Self-Supervised Temporal Synchronization

There is a natural correlation between the visual and auditive elements ...
research
07/16/2022

LAVA: Language Audio Vision Alignment for Contrastive Video Pre-Training

Generating representations of video data is of key importance in advanci...
research
01/13/2022

BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions

Pre-training a model to learn transferable video-text representation for...
research
07/29/2020

Learning Video Representations from Textual Web Supervision

Videos found on the Internet are paired with pieces of text, such as tit...
research
01/11/2021

Learning from Weakly-labeled Web Videos via Exploring Sub-Concepts

Learning visual knowledge from massive weakly-labeled web videos has att...

Please sign up or login with your details

Forgot password? Click here to reset