DALI: a large Dataset of synchronized Audio, LyrIcs and notes, automatically created using teacher-student machine learning paradigm

06/25/2019
by   Gabriel Meseguer-Brocal, et al.
0

The goal of this paper is twofold. First, we introduce DALI, a large and rich multimodal dataset containing 5358 audio tracks with their time-aligned vocal melody notes and lyrics at four levels of granularity. The second goal is to explain our methodology where dataset creation and learning models interact using a teacher-student machine learning paradigm that benefits each other. We start with a set of manual annotations of draft time-aligned lyrics and notes made by non-expert users of Karaoke games. This set comes without audio. Therefore, we need to find the corresponding audio and adapt the annotations to it. To that end, we retrieve audio candidates from the Web. Each candidate is then turned into a singing-voice probability over time using a teacher, a deep convolutional neural network singing-voice detection system (SVD), trained on cleaned data. Comparing the time-aligned lyrics and the singing-voice probability, we detect matches and update the time-alignment lyrics accordingly. From this, we obtain new audio sets. They are then used to train new SVD students used to perform again the above comparison. The process could be repeated iteratively. We show that this allows to progressively improve the performances of our SVD and get better audio-matching and alignment.

READ FULL TEXT
research
10/13/2021

A Melody-Unsupervision Model for Singing Voice Synthesis

Recent studies in singing voice synthesis have achieved high-quality res...
research
03/07/2022

Visually Supervised Speaker Detection and Localization via Microphone Array

Active speaker detection (ASD) is a multi-modal task that aims to identi...
research
04/16/2019

Audio-Visual Model Distillation Using Acoustic Images

In this paper, we investigate how to learn rich and robust feature repre...
research
08/17/2022

Multimodal Lecture Presentations Dataset: Understanding Multimodality in Educational Slides

Lecture slide presentations, a sequence of pages that contain text and f...
research
02/03/2019

Deep Autotuner: A Data-Driven Approach to Natural-Sounding Pitch Correction for Singing Voice in Karaoke Performances

We describe a machine-learning approach to pitch correcting a solo singi...
research
05/07/2018

A Data-Driven Approach to Smooth Pitch Correction for Singing Voice in Pop Music

In this paper, we present a machine-learning approach to pitch correctio...
research
01/30/2022

Self-Supervised Moving Vehicle Detection from Audio-Visual Cues

Robust detection of moving vehicles is a critical task for any autonomou...

Please sign up or login with your details

Forgot password? Click here to reset