Beyond Instructional Videos: Probing for More Diverse Visual-Textual Grounding on YouTube

04/29/2020
by   Jack Hessel, et al.
4

Pretraining from unlabelled web videos has quickly become the de-facto means of achieving high performance on many video understanding tasks. Features are learned via prediction of grounded relationships between visual content and automatic speech recognition (ASR) tokens. However, prior pretraining work has been limited to only instructional videos, a domain that, a priori, we expect to be relatively "easy:" speakers in instructional videos will often reference the literal objects/actions being depicted. Because instructional videos make up only a fraction of the web's diverse video content, we ask: can similar models be trained on broader corpora? And, if so, what types of videos are "grounded" and what types are not? We examine the diverse YouTube8M corpus, first verifying that it contains many non-instructional videos via crowd labeling. We pretrain a representative model on YouTube8M and study its success and failure cases. We find that visual-textual grounding is indeed possible across previously unexplored video categories, and that pretraining on a more diverse set still results in representations that generalize to both non-instructional and instructional domains.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/11/2023

SlideSpeech: A Large-Scale Slide-Enriched Audio-Visual Corpus

Multi-Modal automatic speech recognition (ASR) techniques aim to leverag...
research
01/03/2023

Ego-Only: Egocentric Action Detection without Exocentric Pretraining

We present Ego-Only, the first training pipeline that enables state-of-t...
research
07/10/2020

AViD Dataset: Anonymized Videos from Diverse Countries

We introduce a new public video dataset for action recognition: Anonymiz...
research
04/20/2023

Movie Box Office Prediction With Self-Supervised and Visually Grounded Pretraining

Investments in movie production are associated with a high level of risk...
research
12/01/2017

Visual Features for Context-Aware Speech Recognition

Automatic transcriptions of consumer-generated multi-media content such ...
research
10/07/2019

A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions

Instructional videos get high-traffic on video sharing platforms, and pr...
research
07/09/2019

Transfer Learning from Audio-Visual Grounding to Speech Recognition

Transfer learning aims to reduce the amount of data required to excel at...

Please sign up or login with your details

Forgot password? Click here to reset