AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

06/16/2020
by   Andrew Rouditchenko, et al.
14

Current methods for learning visually grounded language from videos often rely on time-consuming and expensive data collection, such as human annotated textual summaries or machine generated automatic speech recognition transcripts. In this work, we introduce Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs. We circumvent the need for annotation and instead learn audio-visual language representations directly from randomly segmented video clips and their raw audio waveforms. We train AVLnet on publicly available instructional videos and evaluate our model on video clip and language retrieval tasks on three video datasets. Our proposed model outperforms several state-of-the-art text-video baselines by up to 11.8 video clip retrieval task, despite operating on the raw audio instead of manually annotated text captions. Further, we show AVLnet is capable of integrating textual information, increasing its modularity and improving performance by up to 20.3 perform analysis of AVLnet's learned representations, showing our model has learned to relate visual objects with salient words and natural sounds.

READ FULL TEXT

page 6

page 8

page 15

page 16

page 17

research
06/07/2019

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Learning text-video embeddings usually requires a dataset of video clips...
research
12/27/2021

Bridging the Gap: Using Deep Acoustic Representations to Learn Grounded Language from Percepts and Raw Speech

Learning to understand grounded language, which connects natural languag...
research
06/13/2019

Grounding Object Detections With Transcriptions

A vast amount of audio-visual data is available on the Internet thanks t...
research
11/08/2021

Cascaded Multilingual Audio-Visual Learning from Videos

In this paper, we explore self-supervised audio-visual models that learn...
research
04/10/2018

Imagine This! Scripts to Compositions to Videos

Imagining a scene described in natural language with realistic layout an...
research
06/08/2022

Words are all you need? Capturing human sensory similarity with textual descriptors

Recent advances in multimodal training use textual descriptions to signi...
research
06/04/2020

CSTNet: Contrastive Speech Translation Network for Self-Supervised Speech Representation Learning

More than half of the 7,000 languages in the world are in imminent dange...

Please sign up or login with your details

Forgot password? Click here to reset