Watching Too Much Television is Good: Self-Supervised Audio-Visual Representation Learning from Movies and TV Shows

06/16/2021
by   Mahdi M. Kalayeh, et al.
0

The abundance and ease of utilizing sound, along with the fact that auditory clues reveal so much about what happens in the scene, make the audio-visual space a perfectly intuitive choice for self-supervised representation learning. However, the current literature suggests that training on uncurated data yields considerably poorer representations compared to the curated alternatives collected in supervised manner, and the gap only narrows when the volume of data significantly increases. Furthermore, the quality of learned representations is known to be heavily influenced by the size and taxonomy of the curated datasets used for self-supervised training. This begs the question of whether we are celebrating too early on catching up with supervised learning when our self-supervised efforts still rely almost exclusively on curated data. In this paper, we study the efficacy of learning from Movies and TV Shows as forms of uncurated data for audio-visual self-supervised learning. We demonstrate that a simple model based on contrastive learning, trained on a collection of movies and TV shows, not only dramatically outperforms more complex methods which are trained on orders of magnitude larger uncurated datasets, but also performs very competitively with the state-of-the-art that learns from large-scale curated data. We identify that audiovisual patterns like the appearance of the main character or prominent scenes and mise-en-scène which frequently occur through the whole duration of a movie, lead to an overabundance of easy negative instances in the contrastive learning formulation. Capitalizing on such observation, we propose a hierarchical sampling policy, which despite its simplicity, effectively improves the performance, particularly when learning from TV shows which naturally face less semantic diversity.

READ FULL TEXT
research
04/29/2022

On Negative Sampling for Audio-Visual Contrastive Learning from Movies

The abundance and ease of utilizing sound, along with the fact that audi...
research
10/19/2020

CLAR: Contrastive Learning of Auditory Representations

Learning rich visual representations using contrastive self-supervised l...
research
10/21/2020

Contrastive Learning of General-Purpose Audio Representations

We introduce COLA, a self-supervised pre-training approach for learning ...
research
04/28/2021

Shot Contrastive Self-Supervised Learning for Scene Boundary Detection

Scenes play a crucial role in breaking the storyline of movies and TV ep...
research
02/18/2023

Data-Efficient Contrastive Self-supervised Learning: Easy Examples Contribute the Most

Self-supervised learning (SSL) learns high-quality representations from ...
research
06/26/2022

Exploiting Transformation Invariance and Equivariance for Self-supervised Sound Localisation

We present a simple yet effective self-supervised framework for audio-vi...
research
01/26/2021

Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning

Large-scale datasets are the cornerstone of self-supervised representati...

Please sign up or login with your details

Forgot password? Click here to reset