MM-ALT: A Multimodal Automatic Lyric Transcription System

07/13/2022
by   Xiangming Gu, et al.
0

Automatic lyric transcription (ALT) is a nascent field of study attracting increasing interest from both the speech and music information retrieval communities, given its significant application potential. However, ALT with audio data alone is a notoriously difficult task due to instrumental accompaniment and musical constraints resulting in degradation of both the phonetic cues and the intelligibility of sung lyrics. To tackle this challenge, we propose the MultiModal Automatic Lyric Transcription system (MM-ALT), together with a new dataset, N20EM, which consists of audio recordings, videos of lip movements, and inertial measurement unit (IMU) data of an earbud worn by the performing singer. We first adapt the wav2vec 2.0 framework from automatic speech recognition (ASR) to the ALT task. We then propose a video-based ALT method and an IMU-based voice activity detection (VAD) method. In addition, we put forward the Residual Cross Attention (RCA) mechanism to fuse data from the three modalities (i.e., audio, video, and IMU). Experiments show the effectiveness of our proposed MM-ALT system, especially in terms of noise robustness.

READ FULL TEXT

page 5

page 8

research
01/05/2022

Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

Video recordings of speech contain correlated audio and visual informati...
research
04/29/2020

Multiresolution and Multimodal Speech Recognition with Transformers

This paper presents an audio visual automatic speech recognition (AV-ASR...
research
06/21/2021

Attention-based cross-modal fusion for audio-visual voice activity detection in musical video streams

Many previous audio-visual voice-related works focus on speech, ignoring...
research
05/17/2019

The Audio Auditor: Participant-Level Membership Inference in Voice-Based IoT

Voice interfaces and assistants implemented by various services have bec...
research
05/02/2020

MultiQT: Multimodal Learning for Real-Time Question Tracking in Speech

We address a challenging and practical task of labeling questions in spe...
research
05/17/2019

The Audio Auditor: Participant-Level Membership Inference in Internet of Things Voice Services

Voice interfaces and assistants implemented by various services have bec...
research
12/09/2021

X-Vector based voice activity detection for multi-genre broadcast speech-to-text

Voice Activity Detection (VAD) is a fundamental preprocessing step in au...

Please sign up or login with your details

Forgot password? Click here to reset