Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags

10/27/2020
by   Xavier Favory, et al.
7

Self-supervised audio representation learning offers an attractive alternative for obtaining generic audio embeddings, capable to be employed into various downstream tasks. Published approaches that consider both audio and words/tags associated with audio do not employ text processing models that are capable to generalize to tags unknown during training. In this work we propose a method for learning audio representations using an audio autoencoder (AAE), a general word embeddings model (WEM), and a multi-head self-attention (MHA) mechanism. MHA attends on the output of the WEM, providing a contextualized representation of the tags associated with the audio, and we align the output of MHA with the output of the encoder of AAE using a contrastive loss. We jointly optimize AAE and MHA and we evaluate the audio representations (i.e. the output of the encoder of AAE) by utilizing them in three different downstream tasks, namely sound, music genre, and music instrument classification. Our results show that employing multi-head self-attention with multiple heads in the tag-based network can induce better learned audio representations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/19/2018

Audio Based Disambiguation Of Music Genre Tags

In this paper, we propose to infer music genre embeddings from audio dat...
research
06/15/2020

COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations

Audio representation learning based on deep neural networks (DNNs) emerg...
research
04/01/2021

Enriched Music Representations with Multiple Cross-modal Contrastive Learning

Modeling various aspects that make a music piece unique is a challenging...
research
05/24/2019

Self-supervised audio representation learning for mobile devices

We explore self-supervised models that can be potentially deployed on mo...
research
09/14/2023

EnCodecMAE: Leveraging neural codecs for universal audio representation learning

The goal of universal audio representation learning is to obtain foundat...
research
05/31/2023

Learning Music Sequence Representation from Text Supervision

Music representation learning is notoriously difficult for its complex h...
research
11/23/2021

Towards Learning Universal Audio Representations

The ability to learn universal audio representations that can solve dive...

Please sign up or login with your details

Forgot password? Click here to reset