Learning Video Representations from Textual Web Supervision

07/29/2020
by   Jonathan C. Stroud, et al.
5

Videos found on the Internet are paired with pieces of text, such as titles and descriptions. This text typically describes the most important content in the video, such as the objects in the scene and the actions being performed. Based on this observation, we propose to use such text as a method for learning video representations. To accomplish this, we propose a data collection process and use it to collect 70M video clips shared publicly on the Internet, and we then train a model to pair each video with its associated text. We fine-tune the model on several down-stream action recognition tasks, including Kinetics, HMDB-51, and UCF-101. We find that this approach is an effective method of pretraining video representations. Specifically, it leads to improvements over from-scratch training on all benchmarks, outperforms many methods for self-supervised and webly-supervised video representation learning, and achieves an improvement of 2.2

READ FULL TEXT
research
12/13/2019

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Annotating videos is cumbersome, expensive and not scalable. Yet, many s...
research
11/14/2020

ActBERT: Learning Global-Local Video-Text Representations

In this paper, we introduce ActBERT for self-supervised learning of join...
research
12/11/2021

Self-supervised Spatiotemporal Representation Learning by Exploiting Video Continuity

Recent self-supervised video representation learning methods have found ...
research
11/25/2019

Oops! Predicting Unintentional Action in Video

From just a short glance at a video, we can often tell whether a person'...
research
01/16/2020

Learning Spatiotemporal Features via Video and Text Pair Discrimination

Current video representations heavily rely on learning from manually ann...
research
06/21/2022

Bi-Calibration Networks for Weakly-Supervised Video Representation Learning

The leverage of large volumes of web videos paired with the searched que...
research
01/11/2022

Boosting Video Representation Learning with Multi-Faceted Integration

Video content is multifaceted, consisting of objects, scenes, interactio...

Please sign up or login with your details

Forgot password? Click here to reset