Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval

11/24/2017
by   Yi Yu, et al.
0

Little research focuses on cross-modal correlation learning where temporal structures of different data modalities such as audio and lyrics are taken into account. Stemming from the characteristic of temporal structures of music in nature, we are motivated to learn the deep sequential correlation between audio and lyrics. In this work, we propose a deep cross-modal correlation learning architecture involving two-branch deep neural networks for audio modality and text modality (lyrics). Different modality data are converted to the same canonical space where inter modal canonical correlation analysis is utilized as an objective function to calculate the similarity of temporal structures. This is the first study on understanding the correlation between language and music audio through deep architectures for learning the paired temporal correlation of audio and lyrics. Pre-trained Doc2vec model followed by fully-connected layers (fully-connected deep neural network) is used to represent lyrics. Two significant contributions are made in the audio branch, as follows: i) pre-trained CNN followed by fully-connected layers is investigated for representing music audio. ii) We further suggest an end-to-end architecture that simultaneously trains convolutional layers and fully-connected layers to better learn temporal structures of music audio. Particularly, our end-to-end deep architecture contains two properties: simultaneously implementing feature learning and cross-modal correlation learning, and learning joint representation by considering temporal structures. Experimental results, using audio to retrieve lyrics or using lyrics to retrieve audio, verify the effectiveness of the proposed deep correlation learning architectures in cross-modal music retrieval.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/21/2019

Learning Joint Embedding for Cross-Modal Retrieval

A cross-modal retrieval process is to use a query in one modality to obt...
research
08/10/2019

Audio-Visual Embedding for Cross-Modal MusicVideo Retrieval through Supervised Deep CCA

Deep learning has successfully shown excellent performance in learning j...
research
01/14/2019

Learning Shared Semantic Space with Correlation Alignment for Cross-modal Event Retrieval

In this paper, we propose to learn shared semantic space with correlatio...
research
08/24/2022

Interpreting Song Lyrics with an Audio-Informed Pre-trained Language Model

Lyric interpretations can help people understand songs and their lyrics ...
research
09/21/2023

Passage Summarization with Recurrent Models for Audio-Sheet Music Retrieval

Many applications of cross-modal music retrieval are related to connecti...
research
05/18/2020

End-to-End Lip Synchronisation

The goal of this work is to synchronise audio and video of a talking fac...
research
03/29/2022

Iranian Modal Music (Dastgah) detection using deep neural networks

In this work, several deep neural networks are implemented to recognize ...

Please sign up or login with your details

Forgot password? Click here to reset