Xiaoicesing 2: A High-Fidelity Singing Voice Synthesizer Based on Generative Adversarial Network

10/26/2022
by   Chunhui Wang, et al.
0

XiaoiceSing is a singing voice synthesis (SVS) system that aims at generating 48kHz singing voices. However, the mel-spectrogram generated by it is over-smoothing in middle- and high-frequency areas due to no special design for modeling the details of these parts. In this paper, we propose XiaoiceSing2, which can generate the details of middle- and high-frequency parts to better construct the full-band mel-spectrogram. Specifically, in order to alleviate this problem, XiaoiceSing2 adopts a generative adversarial network (GAN), which consists of a FastSpeech-based generator and a multi-band discriminator. We improve the feed-forward Transformer (FFT) block by adding multiple residual convolutional blocks in parallel with the self-attention block to balance the local and global features. The multi-band discriminator contains three sub-discriminators responsible for low-, middle-, and high-frequency parts of the mel-spectrogram, respectively. Each sub-discriminator is composed of several segment discriminators (SD) and detail discriminators (DD) to distinguish the audio from different aspects. The experiment on our internal 48kHz singing voice dataset shows XiaoiceSing2 significantly improves the quality of the singing voice over XiaoiceSing.

READ FULL TEXT

page 2

page 4

research
10/14/2021

SingGAN: Generative Adversarial Network For High-Fidelity Singing Voice Generation

High-fidelity singing voice synthesis is challenging for neural vocoders...
research
10/23/2022

HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation

Entertainment-oriented singing voice synthesis (SVS) requires a vocoder ...
research
06/27/2022

Avocodo: Generative Adversarial Network for Artifact-free Vocoder

Neural vocoders based on the generative adversarial neural network (GAN)...
research
09/03/2020

HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis

High-fidelity singing voices usually require higher sampling rate (e.g.,...
research
03/02/2022

PUFA-GAN: A Frequency-Aware Generative Adversarial Network for 3D Point Cloud Upsampling

We propose a generative adversarial network for point cloud upsampling, ...
research
06/15/2021

UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

Most neural vocoders employ band-limited mel-spectrograms to generate wa...
research
06/24/2021

GAN-MDF: A Method for Multi-fidelity Data Fusion in Digital Twins

The Internet of Things (IoT) collects real-time data of physical systems...

Please sign up or login with your details

Forgot password? Click here to reset