MusCaps: Generating Captions for Music Audio

04/24/2021
by   Ilaria Manco, et al.
8

Content-based music information retrieval has seen rapid progress with the adoption of deep learning. Current approaches to high-level music description typically make use of classification models, such as in auto-tagging or genre and mood classification. In this work, we propose to address music description via audio captioning, defined as the task of generating a natural language description of music audio content in a human-like manner. To this end, we present the first music audio captioning model, MusCaps, consisting of an encoder-decoder with temporal attention. Our method combines convolutional and recurrent neural network architectures to jointly process audio-text inputs through a multimodal encoder and leverages pre-training on audio data to obtain representations that effectively capture and summarise musical features in the input. Evaluation of the generated captions through automatic metrics shows that our method outperforms a baseline designed for non-music audio captioning. Through an ablation study, we unveil that this performance boost can be mainly attributed to pre-training of the audio encoder, while other design choices - modality fusion, decoding strategy and the use of attention - contribute only marginally. Our model represents a shift away from classification-based music description and combines tasks requiring both auditory and linguistic understanding to bridge the semantic gap in music information retrieval.

READ FULL TEXT
research
12/08/2021

Learning music audio representations via weak language supervision

Audio representations for music information retrieval are typically lear...
research
11/24/2020

A Novel Multimodal Music Genre Classifier using Hierarchical Attention and Convolutional Neural Network

Music genre classification is one of the trending topics in regards to t...
research
07/19/2021

Sequence-to-Sequence Piano Transcription with Transformers

Automatic Music Transcription has seen significant progress in recent ye...
research
01/15/2013

Audio Classical Composer Identification by Deep Neural Network

Audio Classical Composer Identification (ACC) is an important problem in...
research
08/22/2023

Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning

Text-to-music generation (T2M-Gen) faces a major obstacle due to the sca...
research
07/31/2023

LP-MusicCaps: LLM-Based Pseudo Music Captioning

Automatic music captioning, which generates natural language description...
research
09/15/2018

Attention as a Perspective for Learning Tempo-invariant Audio Queries

Current models for audio--sheet music retrieval via multimodal embedding...

Please sign up or login with your details

Forgot password? Click here to reset