COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio Representations

06/15/2020
by   Xavier Favory, et al.
0

Audio representation learning based on deep neural networks (DNNs) emerged as an alternative approach to hand-crafted features. For achieving high performance, DNNs often need a large amount of annotated data which can be difficult and costly to obtain. In this paper, we propose a method for learning audio representations, aligning the learned latent representations of audio and associated tags. Aligning is done by maximizing the agreement of the latent representations of audio and tags, using a contrastive loss. The result is an audio embedding model which reflects acoustic and semantic characteristics of sounds. We evaluate the quality of our embedding model, measuring its performance as a feature extractor on three different tasks (namely, sound event recognition, and music genre and musical instrument classification), and investigate what type of characteristics the model captures. Our results are promising, sometimes in par with the state-of-the-art in the considered tasks and the embeddings produced with our method are well correlated with some acoustic descriptors.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/27/2020

Learning Contextual Tag Embeddings for Cross-Modal Alignment of Audio and Tags

Self-supervised audio representation learning offers an attractive alter...
research
04/15/2019

Are Nearby Neighbors Relatives?: Diagnosing Deep Music Embedding Spaces

Deep neural networks have frequently been used to directly learn represe...
research
03/22/2020

Audio Impairment Recognition Using a Correlation-Based Feature Representation

Audio impairment recognition is based on finding noise in audio files an...
research
07/27/2017

Learning Audio Sequence Representations for Acoustic Event Classification

Acoustic Event Classification (AEC) has become a significant task for ma...
research
02/09/2018

Predicting Audio Advertisement Quality

Online audio advertising is a particular form of advertising used abunda...
research
07/12/2016

City-Identification of Flickr Videos Using Semantic Acoustic Features

City-identification of videos aims to determine the likelihood of a vide...
research
07/03/2019

A Case Study of Deep-Learned Activations via Hand-Crafted Audio Features

The explainability of Convolutional Neural Networks (CNNs) is a particul...

Please sign up or login with your details

Forgot password? Click here to reset