Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning

01/26/2021
by   Sangho Lee, et al.
8

Large-scale datasets are the cornerstone of self-supervised representation learning. Existing algorithms extract learning signals by making certain assumptions about the data, e.g., spatio-temporal continuity and multimodal correspondence. Unfortunately, finding a large amount of data that satisfies such assumptions is sometimes not straightforward. This restricts the community to rely on datasets that require laborious annotation and/or manual filtering processes. In this paper, we describe a subset optimization approach for automatic dataset curation. Focusing on the scenario of audio-visual representation learning, we pose the problem as finding a subset that maximizes the mutual information between audio and visual channels in videos. We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales. The most significant benefit of our approach is scalability. We release the largest video dataset for audio-visual research collected automatically using our approach.

READ FULL TEXT

page 8

page 10

page 11

page 14

research
05/14/2019

Self-supervised Audio Spatialization with Correspondence Classifier

Spatial audio is an essential medium to audiences for 3D visual and audi...
research
06/11/2020

Telling Left from Right: Learning Spatial Correspondence between Sight and Sound

Self-supervised audio-visual learning aims to capture useful representat...
research
06/11/2020

Telling Left from Right: Learning Spatial Correspondence of Sight and Sound

Self-supervised audio-visual learning aims to capture useful representat...
research
06/02/2022

Self-supervised Learning of Audio Representations from Audio-Visual Data using Spatial Alignment

Learning from audio-visual data offers many possibilities to express cor...
research
02/24/2022

Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

Training Transformer-based models demands a large amount of data, while ...
research
06/16/2021

Watching Too Much Television is Good: Self-Supervised Audio-Visual Representation Learning from Movies and TV Shows

The abundance and ease of utilizing sound, along with the fact that audi...
research
01/30/2022

Self-Supervised Moving Vehicle Detection from Audio-Visual Cues

Robust detection of moving vehicles is a critical task for any autonomou...

Please sign up or login with your details

Forgot password? Click here to reset