Large-scale weakly-supervised pre-training for video action recognition

05/02/2019
by   Deepti Ghadiyaram, et al.
0

Current fully-supervised video datasets consist of only a few hundred thousand videos and fewer than a thousand domain-specific labels. This hinders the progress towards advanced video architectures. This paper presents an in-depth study of using large volumes of web videos for pre-training video models for the task of action recognition. Our primary empirical finding is that pre-training at a very large scale (over 65 million videos), despite on noisy social-media videos and hashtags, substantially improves the state-of-the-art on three challenging public action recognition datasets. Further, we examine three questions in the construction of weakly-supervised video action datasets. First, given that actions involve interactions with objects, how should one construct a verb-object pre-training label space to benefit transfer learning the most? Second, frame-based models perform quite well on action recognition; is pre-training for good image features sufficient or is pre-training for spatio-temporal features valuable for optimal transfer learning? Finally, actions are generally less well-localized in long videos vs. short videos; since action labels are provided at a video level, how should one choose video clips for best performance, given some fixed budget of number or minutes of videos?

READ FULL TEXT
research
07/21/2020

Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos

Despite the recent advances in video classification, progress in spatio-...
research
05/22/2017

Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset

The paucity of videos in current action classification datasets (UCF-101...
research
01/11/2021

Learning from Weakly-labeled Web Videos via Exploring Sub-Concepts

Learning visual knowledge from massive weakly-labeled web videos has att...
research
11/05/2018

Leveraging Random Label Memorization for Unsupervised Pre-Training

We present a novel approach to leverage large unlabeled datasets by pre-...
research
05/04/2020

Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video

In this paper, we tackle the problem of egocentric action anticipation, ...
research
04/16/2021

Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos

We introduce an approach for pre-training egocentric video models using ...
research
11/27/2019

AdapNet: Adaptability Decomposing Encoder-Decoder Network for Weakly Supervised Action Recognition and Localization

The point process is a solid framework to model sequential data, such as...

Please sign up or login with your details

Forgot password? Click here to reset