Multimodal Self-Supervised Learning of General Audio Representations

04/26/2021
by   Luyu Wang, et al.
0

We present a multimodal framework to learn general audio representations from videos. Existing contrastive audio representation learning methods mainly focus on using the audio modality alone during training. In this work, we show that additional information contained in video can be utilized to greatly improve the learned features. First, we demonstrate that our contrastive framework does not require high resolution images to learn good audio features. This allows us to scale up the training batch size, while keeping the computational load incurred by the additional video modality to a reasonable level. Second, we use augmentations that mix together different samples. We show that this is effective to make the proxy task harder, which leads to substantial performance improvements when increasing the batch size. As a result, our audio model achieves a state-of-the-art of 42.4 mAP on the AudioSet classification downstream task, closing the gap between supervised and self-supervised methods trained on the same dataset. Moreover, we show that our method is advantageous on a broad range of non-semantic audio tasks, including speaker identification, keyword spotting, language identification, and music instrument classification.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/10/2022

Towards Proper Contrastive Self-supervised Learning Strategies For Music Audio Representation

The common research goal of self-supervised learning is to extract a gen...
research
03/21/2023

ModEFormer: Modality-Preserving Embedding for Audio-Video Synchronization using Transformers

Lack of audio-video synchronization is a common problem during televisio...
research
10/26/2020

Contrastive Unsupervised Learning for Audio Fingerprinting

The rise of video-sharing platforms has attracted more and more people t...
research
06/12/2020

Video Understanding as Machine Translation

With the advent of large-scale multimodal video datasets, especially seq...
research
09/16/2023

Enhancing GAN-Based Vocoders with Contrastive Learning Under Data-limited Condition

Vocoder models have recently achieved substantial progress in generating...
research
11/15/2021

Metric-based multimodal meta-learning for human movement identification via footstep recognition

We describe a novel metric-based learning approach that introduces a mul...
research
07/12/2016

City-Identification of Flickr Videos Using Semantic Acoustic Features

City-identification of videos aims to determine the likelihood of a vide...

Please sign up or login with your details

Forgot password? Click here to reset