Cascaded Multilingual Audio-Visual Learning from Videos

11/08/2021
by   Andrew Rouditchenko, et al.
0

In this paper, we explore self-supervised audio-visual models that learn from instructional videos. Prior work has shown that these models can relate spoken words and sounds to visual content after training on a large-scale dataset of videos, but they were only trained and evaluated on videos in English. To learn multilingual audio-visual representations, we propose a cascaded approach that leverages a model trained on English videos and applies it to audio-visual data in other languages, such as Japanese videos. With our cascaded approach, we show an improvement in retrieval performance of nearly 10x compared to training on the Japanese videos solely. We also apply the model trained on English videos to Japanese and Hindi spoken captions of images, achieving state-of-the-art performance.

READ FULL TEXT
research
04/09/2018

Vision as an Interlingua: Learning Multilingual Semantic Embeddings of Untranscribed Speech

In this paper, we explore the learning of neural network embeddings for ...
research
06/16/2020

AVLnet: Learning Audio-Visual Language Representations from Instructional Videos

Current methods for learning visually grounded language from videos ofte...
research
07/22/2022

Fact sheet: Automatic Self-Reported Personality Recognition Track

We propose an informed baseline to help disentangle the various contextu...
research
04/04/2018

Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input

In this paper, we explore neural network models that learn to associate ...
research
09/02/2020

Seeing wake words: Audio-visual Keyword Spotting

The goal of this work is to automatically determine whether and when a w...
research
05/10/2021

Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions

When people observe events, they are able to abstract key information an...
research
07/26/2022

NewsStories: Illustrating articles with visual summaries

Recent self-supervised approaches have used large-scale image-text datas...

Please sign up or login with your details

Forgot password? Click here to reset