Speaker-independent raw waveform model for glottal excitation

04/25/2018
by   Lauri Juvela, et al.
0

Recent speech technology research has seen a growing interest in using WaveNets as statistical vocoders, i.e., generating speech waveforms from acoustic features. These models have been shown to improve the generated speech quality over classical vocoders in many tasks, such as text-to-speech synthesis and voice conversion. Furthermore, conditioning WaveNets with acoustic features allows sharing the waveform generator model across multiple speakers without additional speaker codes. However, multi-speaker WaveNet models require large amounts of training data and computation to cover the entire acoustic space. This paper proposes leveraging the source-filter model of speech production to more effectively train a speaker-independent waveform generator with limited resources. We present a multi-speaker 'GlotNet' vocoder, which utilizes a WaveNet to generate glottal excitation waveforms, which are then used to excite the corresponding vocal tract filter to produce speech. Listening tests show that the proposed model performs favourably to a direct WaveNet vocoder trained with the same model architecture and data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/03/2019

Problem-Agnostic Speech Embeddings for Multi-Speaker Text-to-Speech with SampleRNN

Text-to-speech (TTS) acoustic models map linguistic features into an aco...
research
06/29/2021

FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis

Methods for modeling and controlling prosody with acoustic features have...
research
04/07/2022

Correcting Misproducted Speech using Spectrogram Inpainting

Learning a new language involves constantly comparing speech productions...
research
08/05/2023

A Systematic Exploration of Joint-training for Singing Voice Synthesis

There has been a growing interest in using end-to-end acoustic models fo...
research
04/08/2022

Karaoker: Alignment-free singing voice synthesis with speech training data

Existing singing voice synthesis models (SVS) are usually trained on sin...
research
11/10/2020

Pretraining Strategies, Waveform Model Choice, and Acoustic Configurations for Multi-Speaker End-to-End Speech Synthesis

We explore pretraining strategies including choice of base corpus with t...
research
06/16/2021

Improving the expressiveness of neural vocoding with non-affine Normalizing Flows

This paper proposes a general enhancement to the Normalizing Flows (NF) ...

Please sign up or login with your details

Forgot password? Click here to reset