End-to-end Lyrics Alignment for Polyphonic Music Using an Audio-to-Character Recognition Model

02/18/2019
by   Daniel Stoller, et al.
0

Time-aligned lyrics can enrich the music listening experience by enabling karaoke, text-based song retrieval and intra-song navigation, and other applications. Compared to text-to-speech alignment, lyrics alignment remains highly challenging, despite many attempts to combine numerous sub-modules including vocal separation and detection in an effort to break down the problem. Furthermore, training required fine-grained annotations to be available in some form. Here, we present a novel system based on a modified Wave-U-Net architecture, which predicts character probabilities directly from raw audio using learnt multi-scale representations of the various signal components. There are no sub-modules whose interdependencies need to be optimized. Our training procedure is designed to work with weak, line-level annotations available in the real world. With a mean alignment error of 0.35s on a standard dataset our system outperforms the state-of-the-art by an order of magnitude.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/27/2021

Audio-to-Score Alignment Using Deep Automatic Music Transcription

Audio-to-score alignment (A2SA) is a multimodal task consisting in the a...
research
02/12/2020

AlignNet: A Unifying Approach to Audio-Visual Alignment

We present AlignNet, a model that synchronizes videos with reference aud...
research
11/13/2017

Audio-to-score alignment of piano music using RNN-based automatic music transcription

We propose a framework for audio-to-score alignment on piano performance...
research
06/13/2023

Contrastive Learning-Based Audio to Lyrics Alignment for Multiple Languages

Lyrics alignment gained considerable attention in recent years. State-of...
research
11/05/2018

Manner of Articulation Detection using Connectionist Temporal Classification to Improve Automatic Speech Recognition Performance

Conventionally, the manner of articulations in speech signal are derived...
research
06/05/2020

End-to-End Adversarial Text-to-Speech

Modern text-to-speech synthesis pipelines typically involve multiple pro...
research
07/10/2023

HCLAS-X: Hierarchical and Cascaded Lyrics Alignment System Using Multimodal Cross-Correlation

In this work, we address the challenge of lyrics alignment, which involv...

Please sign up or login with your details

Forgot password? Click here to reset