FluentEditor: Text-based Speech Editing by Considering Acoustic and Prosody Consistency

09/21/2023
by   Rui Liu, et al.
0

Text-based speech editing (TSE) techniques are designed to enable users to edit the output audio by modifying the input text transcript instead of the audio itself. Despite much progress in neural network-based TSE techniques, the current techniques have focused on reducing the difference between the generated speech segment and the reference target in the editing region, ignoring its local and global fluency in the context and original utterance. To maintain the speech fluency, we propose a fluency speech editing model, termed FluentEditor, by considering fluency-aware training criterion in the TSE training. Specifically, the acoustic consistency constraint aims to smooth the transition between the edited region and its neighboring acoustic segments consistent with the ground truth, while the prosody consistency constraint seeks to ensure that the prosody attributes within the edited regions remain consistent with the overall style of the original utterance. The subjective and objective experimental results on VCTK demonstrate that our FluentEditor outperforms all advanced baselines in terms of naturalness and fluency. The audio samples and code are available at <https://github.com/Ai-S2-Lab/FluentEditor>.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/21/2023

Emotion-Aware Prosodic Phrasing for Expressive Text-to-Speech

Prosodic phrasing is crucial to the naturalness and intelligibility of e...
research
07/04/2021

EditSpeech: A Text Based Speech Editing System Using Partial Inference and Bidirectional Fusion

This paper presents the design, implementation and evaluation of a speec...
research
05/23/2023

FluentSpeech: Stutter-Oriented Automatic Speech Editing with Context-Aware Diffusion Models

Stutter removal is an essential scenario in the field of speech editing....
research
10/06/2021

EdiTTS: Score-based Editing for Controllable Text-to-Speech

We present EdiTTS, an off-the-shelf speech editing methodology based on ...
research
06/13/2023

UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding

The utilization of discrete speech tokens, divided into semantic tokens ...
research
11/13/2022

OverFlow: Putting flows on top of neural transducers for better TTS

Neural HMMs are a type of neural transducer recently proposed for sequen...
research
07/30/2021

Graph-PIT: Generalized permutation invariant training for continuous separation of arbitrary numbers of speakers

Automatic transcription of meetings requires handling of overlapped spee...

Please sign up or login with your details

Forgot password? Click here to reset