HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

06/07/2019
by   Antoine Miech, et al.
1

Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone. Our dataset, code and models will be publicly available at: www.di.ens.fr/willow/research/howto100m/.

READ FULL TEXT

page 1

page 3

page 8

page 12

page 15

research
06/16/2020

AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

Current methods for learning visually grounded language from videos ofte...
research
10/13/2021

NoisyActions2M: A Multimedia Dataset for Video Understanding from Noisy Labels

Deep learning has shown remarkable progress in a wide range of problems....
research
04/07/2018

Learning a Text-Video Embedding from Incomplete and Heterogeneous Data

Joint understanding of video and language is an active research area wit...
research
09/26/2022

Multi-modal Video Chapter Generation

Chapter generation becomes practical technique for online videos nowaday...
research
08/28/2023

CoVR: Learning Composed Video Retrieval from Web Video Captions

Composed Image Retrieval (CoIR) has recently gained popularity as a task...
research
09/26/2016

Learning Language-Visual Embedding for Movie Understanding with Natural-Language

Learning a joint language-visual embedding has a number of very appealin...
research
06/13/2019

Grounding Object Detections With Transcriptions

A vast amount of audio-visual data is available on the Internet thanks t...

Please sign up or login with your details

Forgot password? Click here to reset