Exploring modality-agnostic representations for music classification

06/02/2021
by   Ho-Hsiang Wu, et al.
0

Music information is often conveyed or recorded across multiple data modalities including but not limited to audio, images, text and scores. However, music information retrieval research has almost exclusively focused on single modality recognition, requiring development of separate models for each modality. Some multi-modal works require multiple coexisting modalities given to the model as inputs, constraining the use of these models to the few cases where data from all modalities are available. To the best of our knowledge, no existing model has the ability to take inputs from varying modalities, e.g. images or sounds, and classify them into unified music categories. We explore the use of cross-modal retrieval as a pretext task to learn modality-agnostic representations, which can then be used as inputs to classifiers that are independent of modality. We select instrument classification as an example task for our study as both visual and audio components provide relevant semantic information. We train music instrument classifiers that can take both images or sounds as input, and perform comparably to sound-only or image-only classifiers. Furthermore, we explore the case when there is limited labeled data for a given modality, and the impact in performance by using labeled data from other modalities. We are able to achieve almost 70 system in a zero-shot setting. We provide a detailed analysis of experimental results to understand the potential and limitations of the approach, and discuss future steps towards modality-agnostic classifiers.

READ FULL TEXT
research
02/12/2019

Cross-Modal Music Retrieval and Applications: An Overview of Key Methodologies

There has been a rapid growth of digitally available music data, includi...
research
08/12/2023

Contrastive Learning for Cross-modal Artist Retrieval

Music retrieval and recommendation applications often rely on content fe...
research
11/09/2018

Identify, locate and separate: Audio-visual object extraction in large video collections using weak supervision

We tackle the problem of audiovisual scene analysis for weakly-labeled d...
research
05/18/2023

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

In this work, we explore a scalable way for building a general represent...
research
04/06/2022

Late multimodal fusion for image and audio music transcription

Music transcription, which deals with the conversion of music sources in...
research
11/11/2021

Learning Signal-Agnostic Manifolds of Neural Fields

Deep neural networks have been used widely to learn the latent structure...
research
12/11/2017

Learning Modality-Invariant Representations for Speech and Images

In this paper, we explore the unsupervised learning of a semantic embedd...

Please sign up or login with your details

Forgot password? Click here to reset