Self-Supervised Contrastive Learning for Robust Audio-Sheet Music Retrieval Systems

by   Luis Carvalho, et al.

Linking sheet music images to audio recordings remains a key problem for the development of efficient cross-modal music retrieval systems. One of the fundamental approaches toward this task is to learn a cross-modal embedding space via deep neural networks that is able to connect short snippets of audio and sheet music. However, the scarcity of annotated data from real musical content affects the capability of such methods to generalize to real retrieval scenarios. In this work, we investigate whether we can mitigate this limitation with self-supervised contrastive learning, by exposing a network to a large amount of real music data as a pre-training step, by contrasting randomly augmented views of snippets of both modalities, namely audio and sheet images. Through a number of experiments on synthetic and real piano data, we show that pre-trained models are able to retrieve snippets with better precision in all scenarios and pre-training configurations. Encouraged by these results, we employ the snippet embeddings in the higher-level task of cross-modal piece identification and conduct more experiments on several retrieval configurations. In this task, we observe that the retrieval quality improves from 30 arguing for the potential of self-supervised contrastive learning for alleviating the annotated data scarcity in multi-modal music retrieval models.


page 1

page 2

page 3

page 4


Passage Summarization with Recurrent Models for Audio-Sheet Music Retrieval

Many applications of cross-modal music retrieval are related to connecti...

Cross-Modal Music-Video Recommendation: A Study of Design Choices

In this work, we study music/video cross-modal recommendation, i.e. reco...

Learning music audio representations via weak language supervision

Audio representations for music information retrieval are typically lear...

RECAP: Retrieval Augmented Music Captioner

With the prevalence of stream media platforms serving music search and r...

Self-Supervised Beat Tracking in Musical Signals with Polyphonic Contrastive Learning

Annotating musical beats is a very long in tedious process. In order to ...

Interpreting Song Lyrics with an Audio-Informed Pre-trained Language Model

Lyric interpretations can help people understand songs and their lyrics ...

Contrastive Learning for Cross-modal Artist Retrieval

Music retrieval and recommendation applications often rely on content fe...

Please sign up or login with your details

Forgot password? Click here to reset