DeepSinger: Singing Voice Synthesis with Data Mined From the Web

07/09/2020
by   Yi Ren, et al.
0

In this paper, we develop DeepSinger, a multi-lingual multi-singer singing voice synthesis (SVS) system, which is built from scratch using singing training data mined from music websites. The pipeline of DeepSinger consists of several steps, including data crawling, singing and accompaniment separation, lyrics-to-singing alignment, data filtration, and singing modeling. Specifically, we design a lyrics-to-singing alignment model to automatically extract the duration of each phoneme in lyrics starting from coarse-grained sentence level to fine-grained phoneme level, and further design a multi-lingual multi-singer singing model based on a feed-forward Transformer to directly generate linear-spectrograms from lyrics, and synthesize voices using Griffin-Lim. DeepSinger has several advantages over previous SVS systems: 1) to the best of our knowledge, it is the first SVS system that directly mines training data from music websites, 2) the lyrics-to-singing alignment model further avoids any human efforts for alignment labeling and greatly reduces labeling cost, 3) the singing model based on a feed-forward Transformer is simple and efficient, by removing the complicated acoustic feature modeling in parametric synthesis and leveraging a reference encoder to capture the timbre of a singer from noisy singing data, and 4) it can synthesize singing voices in multiple languages and multiple singers. We evaluate DeepSinger on our mined singing dataset that consists of about 92 hours data from 89 singers on three languages (Chinese, Cantonese and English). The results demonstrate that with the singing data purely mined from the Web, DeepSinger can synthesize high-quality singing voices in terms of both pitch accuracy and voice naturalness (footnote: Our audio samples are shown in https://speechresearch.github.io/deepsinger/.)

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/18/2023

RMSSinger: Realistic-Music-Score based Singing Voice Synthesis

We are interested in a challenging task, Realistic-Music-Score based Sin...
research
10/22/2019

Sequence-to-sequence Singing Synthesis Using the Feed-forward Transformer

We propose a sequence-to-sequence singing synthesizer, which avoids the ...
research
05/15/2020

JDI-T: Jointly trained Duration Informed Transformer for Text-To-Speech without Explicit Alignment

We propose Jointly trained Duration Informed Transformer (JDI-T), a feed...
research
04/08/2022

Karaoker: Alignment-free singing voice synthesis with speech training data

Existing singing voice synthesis models (SVS) are usually trained on sin...
research
03/21/2022

WeSinger: Data-augmented Singing Voice Synthesis with Auxiliary Losses

In this paper, we develop a new multi-singer Chinese neural singing voic...
research
08/07/2020

Peking Opera Synthesis via Duration Informed Attention Network

Peking Opera has been the most dominant form of Chinese performing art s...
research
03/04/2020

AlignTTS: Efficient Feed-Forward Text-to-Speech System without Explicit Alignment

Targeting at both high efficiency and performance, we propose AlignTTS t...

Please sign up or login with your details

Forgot password? Click here to reset