Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning

08/22/2023
by   Shansong Liu, et al.
0

Text-to-music generation (T2M-Gen) faces a major obstacle due to the scarcity of large-scale publicly available music datasets with natural language captions. To address this, we propose the Music Understanding LLaMA (MU-LLaMA), capable of answering music-related questions and generating captions for music files. Our model utilizes audio representations from a pretrained MERT model to extract music features. However, obtaining a suitable dataset for training the MU-LLaMA model remains challenging, as existing publicly accessible audio question answering datasets lack the necessary depth for open-ended music question answering. To fill this gap, we present a methodology for generating question-answer pairs from existing audio captioning datasets and introduce the MusicQA Dataset designed for answering open-ended music-related questions. The experiments demonstrate that the proposed MU-LLaMA model, trained on our designed MusicQA dataset, achieves outstanding performance in both music question answering and music caption generation across various metrics, outperforming current state-of-the-art (SOTA) models in both fields and offering a promising advancement in the T2M-Gen research field.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/15/2023

MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

Large Language Models (LLMs) have shown immense potential in multimodal ...
research
07/31/2023

LP-MusicCaps: LLM-Based Pseudo Music Captioning

Automatic music captioning, which generates natural language description...
research
03/07/2020

PathVQA: 30000+ Questions for Medical Visual Question Answering

Is it possible to develop an "AI Pathologist" to pass the board-certifie...
research
04/24/2021

MusCaps: Generating Captions for Music Audio

Content-based music information retrieval has seen rapid progress with t...
research
08/03/2023

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

We present the All-Seeing (AS) project: a large-scale data and model for...
research
06/10/2020

ClarQ: A large-scale and diverse dataset for Clarification Question Generation

Question answering and conversational systems are often baffled and need...
research
09/20/2023

Investigating Personalization Methods in Text to Music Generation

In this work, we investigate the personalization of text-to-music diffus...

Please sign up or login with your details

Forgot password? Click here to reset