Contrastive Learning-Based Audio to Lyrics Alignment for Multiple Languages

06/13/2023
by   Simon Durand, et al.
0

Lyrics alignment gained considerable attention in recent years. State-of-the-art systems either re-use established speech recognition toolkits, or design end-to-end solutions involving a Connectionist Temporal Classification (CTC) loss. However, both approaches suffer from specific weaknesses: toolkits are known for their complexity, and CTC systems use a loss designed for transcription which can limit alignment accuracy. In this paper, we use instead a contrastive learning procedure that derives cross-modal embeddings linking the audio and text domains. This way, we obtain a novel system that is simple to train end-to-end, can make use of weakly annotated training data, jointly learns a powerful text model, and is tailored to alignment. The system is not only the first to yield an average absolute error below 0.2 seconds on the standard Jamendo dataset but it is also robust to other languages, even when trained on English data only. Finally, we release word-level alignments for the JamendoLyrics Multi-Lang dataset.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/14/2023

Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder

In recent research, slight performance improvement is observed from auto...
research
05/18/2018

Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces

Recent research has shown that word embedding spaces learned from text c...
research
06/15/2022

Text-Aware End-to-end Mispronunciation Detection and Diagnosis

Mispronunciation detection and diagnosis (MDD) technology is a key compo...
research
04/11/2022

Tokenwise Contrastive Pretraining for Finer Speech-to-BERT Alignment in End-to-End Speech-to-Intent Systems

Recent advances in End-to-End (E2E) Spoken Language Understanding (SLU) ...
research
02/03/2022

Improving Lyrics Alignment through Joint Pitch Detection

In recent years, the accuracy of automatic lyrics alignment methods has ...
research
02/18/2019

End-to-end Lyrics Alignment for Polyphonic Music Using an Audio-to-Character Recognition Model

Time-aligned lyrics can enrich the music listening experience by enablin...
research
03/27/2022

End-to-End Active Speaker Detection

Recent advances in the Active Speaker Detection (ASD) problem build upon...

Please sign up or login with your details

Forgot password? Click here to reset