A Deep-Bayesian Framework for Adaptive Speech Duration Modification

07/11/2021
by   Ravi Shankar, et al.
0

We propose the first method to adaptively modify the duration of a given speech signal. Our approach uses a Bayesian framework to define a latent attention map that links frames of the input and target utterances. We train a masked convolutional encoder-decoder network to produce this attention map via a stochastic version of the mean absolute error loss function; our model also predicts the length of the target speech signal using the encoder embeddings. The predicted length determines the number of steps for the decoder operation. During inference, we generate the attention map as a proxy for the similarity matrix between the given input speech and an unknown target speech signal. Using this similarity matrix, we compute a warping path of alignment between the two signals. Our experiments demonstrate that this adaptive framework produces similar results to dynamic time warping, which relies on a known target signal, on both voice conversion and emotion conversion tasks. We also show that our technique results in a high quality of generated speech that is on par with state-of-the-art vocoders.

READ FULL TEXT

page 1

page 2

page 4

research
03/29/2019

Joint training framework for text-to-speech and voice conversion using multi-source Tacotron and WaveNet

We investigated the training of a shared model for both text-to-speech (...
research
04/09/2019

Crossmodal Voice Conversion

Humans are able to imagine a person's voice from the person's appearance...
research
01/26/2022

Noise-robust voice conversion with domain adversarial training

Voice conversion has made great progress in the past few years under the...
research
08/19/2019

Salient Speech Representations Based on Cloned Networks

We define salient features as features that are shared by signals that a...
research
10/30/2018

Feature Trajectory Dynamic Time Warping for Clustering of Speech Segments

Dynamic time warping (DTW) can be used to compute the similarity between...
research
07/25/2020

Multi-speaker Emotion Conversion via Latent Variable Regularization and a Chained Encoder-Decoder-Predictor Network

We propose a novel method for emotion conversion in speech based on a ch...
research
02/16/2020

Speech-to-Singing Conversion in an Encoder-Decoder Framework

In this paper our goal is to convert a set of spoken lines into sung one...

Please sign up or login with your details

Forgot password? Click here to reset