Look, Listen and Learn

05/23/2017
by   Relja Arandjelović, et al.
0

We consider the question: what can be learnt by looking at and listening to a large number of unlabelled videos? There is a valuable, but so far untapped, source of information contained in the video itself -- the correspondence between the visual and the audio streams, and we introduce a novel "Audio-Visual Correspondence" learning task that makes use of this. Training visual and audio networks from scratch, without any additional supervision other than the raw unconstrained videos themselves, is shown to successfully solve this task, and, more interestingly, result in good visual and audio representations. These features set the new state-of-the-art on two sound classification benchmarks, and perform on par with the state-of-the-art self-supervised approaches on ImageNet classification. We also demonstrate that the network is able to localize objects in both modalities, as well as perform fine-grained recognition tasks.

READ FULL TEXT

page 3

page 6

page 8

page 9

page 10

page 11

page 14

research
06/29/2020

Self-Supervised MultiModal Versatile Networks

Videos are a rich source of multi-modal supervision. In this work, we le...
research
09/19/2019

Look, Read and Enrich. Learning from Scientific Figures and their Captions

Compared to natural images, understanding scientific figures is particul...
research
09/14/2020

Themes Informed Audio-visual Correspondence Learning

The applications of short-term user-generated video (UGV), such as Snapc...
research
05/12/2023

Hear to Segment: Unmixing the Audio to Guide the Semantic Segmentation

In this paper, we focus on a recently proposed novel task called Audio-V...
research
12/15/2022

MAViL: Masked Audio-Video Learners

We present Masked Audio-Video Learners (MAViL) to train audio-visual rep...
research
12/22/2021

Class-aware Sounding Objects Localization via Audiovisual Correspondence

Audiovisual scenes are pervasive in our daily life. It is commonplace fo...
research
12/18/2020

Temporal Bilinear Encoding Network of Audio-Visual Features at Low Sampling Rates

Current deep learning based video classification architectures are typic...

Please sign up or login with your details

Forgot password? Click here to reset