Cross-modal variational inference for bijective signal-symbol translation

Extraction of symbolic information from signals is an active field of research enabling numerous applications especially in the Musical Information Retrieval domain. This complex task, that is also related to other topics such as pitch extraction or instrument recognition, is a demanding subject that gave birth to numerous approaches, mostly based on advanced signal processing-based algorithms. However, these techniques are often non-generic, allowing the extraction of definite physical properties of the signal (pitch, octave), but not allowing arbitrary vocabularies or more general annotations. On top of that, these techniques are one-sided, meaning that they can extract symbolic data from an audio signal, but cannot perform the reverse process and make symbol-to-signal generation. In this paper, we propose an bijective approach for signal/symbol translation by turning this problem into a density estimation task over signal and symbolic domains, considered both as related random variables. We estimate this joint distribution with two different variational auto-encoders, one for each domain, whose inner representations are forced to match with an additive constraint, allowing both models to learn and generate separately while allowing signal-to-symbol and symbol-to-signal inference. In this article, we test our models on pitch, octave and dynamics symbols, which comprise a fundamental step towards music transcription and label-constrained audio generation. In addition to its versatility, this system is rather light during training and generation while allowing several interesting creative uses that we outline at the end of the article.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 5

page 6

page 7

research
04/07/2022

Musical Information Extraction from the Singing Voice

Music information retrieval is currently an active research area that ad...
research
11/08/2018

Learning Disentangled Representations for Timber and Pitch in Music Audio

Timbre and pitch are the two main perceptual properties of musical sound...
research
09/08/2021

Signal-domain representation of symbolic music for learning embedding spaces

A key aspect of machine learning models lies in their ability to learn e...
research
10/14/2022

On the Relationship Between Variational Inference and Auto-Associative Memory

In this article, we propose a variational inference formulation of auto-...
research
09/29/2018

Modulated Variational auto-Encoders for many-to-many musical timbre transfer

Generative models have been successfully applied to image style transfer...
research
06/01/1999

The Symbol Grounding Problem

How can the semantic interpretation of a formal symbol system be made in...

Please sign up or login with your details

Forgot password? Click here to reset