Singing-Tacotron: Global duration control attention and dynamic filter for End-to-end singing voice synthesis

02/16/2022
by   PetsTime, et al.
0

End-to-end singing voice synthesis (SVS) is attractive due to the avoidance of pre-aligned data. However, the auto learned alignment of singing voice with lyrics is difficult to match the duration information in musical score, which will lead to the model instability or even failure to synthesize voice. To learn accurate alignment information automatically, this paper proposes an end-to-end SVS framework, named Singing-Tacotron. The main difference between the proposed framework and Tacotron is that the speech can be controlled significantly by the musical score's duration information. Firstly, we propose a global duration control attention mechanism for the SVS model. The attention mechanism can control each phoneme's duration. Secondly, a duration encoder is proposed to learn a set of global transition tokens from the musical score. These transition tokens can help the attention mechanism decide whether moving to the next phoneme or staying at each decoding step. Thirdly, to further improve the model's stability, a dynamic filter is designed to help the model overcome noise interference and pay more attention to local context information. Subjective and objective evaluation verify the effectiveness of the method. Furthermore, the role of global transition tokens and the effect of duration control are explored. Examples of experiments can be found at https://hairuo55.github.io/SingingTacotron.

READ FULL TEXT

page 3

page 4

research
12/28/2022

Singing Voice Synthesis Based on a Musical Note Position-Aware Attention Mechanism

This paper proposes a novel sequence-to-sequence (seq2seq) model with a ...
research
05/18/2020

MoBoAligner: a Neural Alignment Model for Non-autoregressive TTS with Monotonic Boundary Search

To speed up the inference of neural speech synthesis, non-autoregressive...
research
04/23/2020

ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders

This paper presents ByteSing, a Chinese singing voice synthesis (SVS) sy...
research
06/11/2020

XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System

This paper presents XiaoiceSing, a high-quality singing voice synthesis ...
research
10/09/2021

PAMA-TTS: Progression-Aware Monotonic Attention for Stable Seq2Seq TTS With Accurate Phoneme Duration Control

Sequence expansion between encoder and decoder is a critical challenge i...
research
01/05/2023

Singing voice synthesis based on frame-level sequence-to-sequence models considering vocal timing deviation

This paper proposes singing voice synthesis (SVS) based on frame-level s...
research
05/24/2022

SUSing: SU-net for Singing Voice Synthesis

Singing voice synthesis is a generative task that involves multi-dimensi...

Please sign up or login with your details

Forgot password? Click here to reset