APNet: An All-Frame-Level Neural Vocoder Incorporating Direct Prediction of Amplitude and Phase Spectra

05/13/2023
by   Yang Ai, et al.
0

This paper presents a novel neural vocoder named APNet which reconstructs speech waveforms from acoustic features by predicting amplitude and phase spectra directly. The APNet vocoder is composed of an amplitude spectrum predictor (ASP) and a phase spectrum predictor (PSP). The ASP is a residual convolution network which predicts frame-level log amplitude spectra from acoustic features. The PSP also adopts a residual convolution network using acoustic features as input, then passes the output of this network through two parallel linear convolution layers respectively, and finally integrates into a phase calculation formula to estimate frame-level phase spectra. Finally, the outputs of ASP and PSP are combined to reconstruct speech waveforms by inverse short-time Fourier transform (ISTFT). All operations of the ASP and PSP are performed at the frame level. We train the ASP and PSP jointly and define multilevel loss functions based on amplitude mean square error, phase anti-wrapping error, short-time spectral inconsistency error and time domain reconstruction error. Experimental results show that our proposed APNet vocoder achieves an approximately 8x faster inference speed than HiFi-GAN v1 on a CPU due to the all-frame-level operations, while its synthesized speech quality is comparable to HiFi-GAN v1. The synthesized speech quality of the APNet vocoder is also better than that of several equally efficient models. Ablation experiments also confirm that the proposed parallel phase estimation architecture is essential to phase modeling and the proposed loss functions are helpful for improving the synthesized speech quality.

READ FULL TEXT

page 1

page 9

page 10

page 11

research
06/23/2019

A Neural Vocoder with Hierarchical Generation of Amplitude and Phase Spectra for Statistical Parametric Speech Synthesis

This paper presents a neural vocoder named HiNet which reconstructs spee...
research
11/29/2022

Neural Speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrapping Losses

This paper presents a novel speech phase prediction model which predicts...
research
08/17/2023

Long-frame-shift Neural Speech Phase Prediction with Spectral Continuity Enhancement and Interpolation Error Compensation

Speech phase prediction, which is a significant research focus in the fi...
research
04/16/2020

Knowledge-and-Data-Driven Amplitude Spectrum Prediction for Hierarchical Neural Vocoders

In our previous work, we have proposed a neural vocoder called HiNet whi...
research
11/08/2020

Denoising-and-Dereverberation Hierarchical Neural Vocoder for Robust Waveform Generation

This paper presents a denoising and dereverberation hierarchical neural ...
research
08/17/2023

Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality Speech Enhancement

Phase information has a significant impact on speech perceptual quality ...
research
01/26/2019

End-to-End Multi-Task Denoising for joint SDR and PESQ Optimization

Supervised learning based on a deep neural network recently has achieved...

Please sign up or login with your details

Forgot password? Click here to reset