Learning music audio representations via weak language supervision

12/08/2021
by   Ilaria Manco, et al.
10

Audio representations for music information retrieval are typically learned via supervised learning in a task-specific fashion. Although effective at producing state-of-the-art results, this scheme lacks flexibility with respect to the range of applications a model can have and requires extensively annotated datasets. In this work, we pose the question of whether it may be possible to exploit weakly aligned text as the only supervisory signal to learn general-purpose music audio representations. To address this question, we design a multimodal architecture for music and language pre-training (MuLaP) optimised via a set of proxy tasks. Weak supervision is provided in the form of noisy natural language descriptions conveying the overall musical content of the track. After pre-training, we transfer the audio backbone of the model to a set of music audio classification and regression tasks. We demonstrate the usefulness of our approach by comparing the performance of audio representations produced by the same audio backbone with different training strategies and show that our pre-training method consistently achieves comparable or higher scores on all tasks and datasets considered. Our experiments also confirm that MuLaP effectively leverages audio-caption pairs to learn representations that are competitive with audio-only and cross-modal self-supervised methods in the literature.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/21/2023

Self-Supervised Contrastive Learning for Robust Audio-Sheet Music Retrieval Systems

Linking sheet music images to audio recordings remains a key problem for...
research
04/24/2021

MusCaps: Generating Captions for Music Audio

Content-based music information retrieval has seen rapid progress with t...
research
10/07/2022

Supervised and Unsupervised Learning of Audio Representations for Music Understanding

In this work, we provide a broad comparative analysis of strategies for ...
research
02/22/2020

DECIBEL: Improving Audio Chord Estimation for Popular Music by Alignment and Integration of Crowd-Sourced Symbolic Representations

Automatic Chord Estimation (ACE) is a fundamental task in Music Informat...
research
02/12/2018

One Deep Music Representation to Rule Them All? : A comparative analysis of different representation learning strategies

Inspired by the success of deploying deep learning in the fields of Comp...
research
05/31/2023

Learning Music Sequence Representation from Text Supervision

Music representation learning is notoriously difficult for its complex h...
research
07/12/2021

Codified audio language modeling learns useful representations for music information retrieval

We demonstrate that language models pre-trained on codified (discretely-...

Please sign up or login with your details

Forgot password? Click here to reset