Language-Guided Music Recommendation for Video via Prompt Analogies

06/15/2023
by   Daniel McKee, et al.
0

We propose a method to recommend music for an input video while allowing a user to guide music selection with free-form natural language. A key challenge of this problem setting is that existing music video datasets provide the needed (video, music) training pairs, but lack text descriptions of the music. This work addresses this challenge with the following three contributions. First, we propose a text-synthesis approach that relies on an analogy-based prompting procedure to generate natural language music descriptions from a large-scale language model (BLOOM-176B) given pre-trained music tagger outputs and a small number of human text descriptions. Second, we use these synthesized music descriptions to train a new trimodal model, which fuses text and video input representations to query music samples. For training, we introduce a text dropout regularization mechanism which we show is critical to model performance. Our model design allows for the retrieved music audio to agree with the two input modalities by matching visual style depicted in the video and musical genre, mood, or instrumentation described in the natural language query. Third, to evaluate our approach, we collect a testing dataset for our problem by annotating a subset of 4k clips from the YT8M-MusicVideo dataset with natural language music descriptions which we make publicly available. We show that our approach can match or exceed the performance of prior methods on video-to-music retrieval while significantly improving retrieval accuracy when using text guidance.

READ FULL TEXT
research
08/17/2016

Towards Music Captioning: Generating Music Playlist Descriptions

Descriptions are often provided along with recommendations to help users...
research
11/21/2022

Exploring the Efficacy of Pre-trained Checkpoints in Text-to-Music Generation Task

Benefiting from large-scale datasets and pre-trained models, the field o...
research
05/23/2023

When the Music Stops: Tip-of-the-Tongue Retrieval for Music

We present a study of Tip-of-the-tongue (ToT) retrieval for music, where...
research
09/21/2021

Audio Interval Retrieval using Convolutional Neural Networks

Modern streaming services are increasingly labeling videos based on thei...
research
10/02/2022

Music-to-Text Synaesthesia: Generating Descriptive Text from Music Recordings

In this paper, we consider a novel research problem, music-to-text synae...
research
03/21/2023

In-depth analysis of music structure as a self-organized network

Words in a natural language not only transmit information but also evolv...
research
07/07/2023

LaunchpadGPT: Language Model as Music Visualization Designer on Launchpad

Launchpad is a musical instrument that allows users to create and perfor...

Please sign up or login with your details

Forgot password? Click here to reset