Machine Speech Chain with One-shot Speaker Adaptation

03/28/2018
by   Andros Tjandra, et al.
0

In previous work, we developed a closed-loop speech chain model based on deep learning, in which the architecture enabled the automatic speech recognition (ASR) and text-to-speech synthesis (TTS) components to mutually improve their performance. This was accomplished by the two parts teaching each other using both labeled and unlabeled data. This approach could significantly improve model performance within a single-speaker speech dataset, but only a slight increase could be gained in multi-speaker tasks. Furthermore, the model is still unable to handle unseen speakers. In this paper, we present a new speech chain mechanism by integrating a speaker recognition model inside the loop. We also propose extending the capability of TTS to handle unseen speakers by implementing one-shot speaker adaptation. This enables TTS to mimic voice characteristics from one speaker to another with only a one-shot speaker sample, even from a text without any speaker information. In the speech chain loop mechanism, ASR also benefits from the ability to further learn an arbitrary speaker's characteristics from the generated speech waveform, resulting in a significant improvement in the recognition rate.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/16/2017

Listening while Speaking: Speech Chain by Deep Learning

Despite the close relationship between speech perception and production,...
research
11/04/2020

Augmenting Images for ASR and TTS through Single-loop and Dual-loop Multimodal Chain Framework

Previous research has proposed a machine speech chain to enable automati...
research
08/07/2020

A Machine of Few Words – Interactive Speaker Recognition with Reinforcement Learning

Speaker recognition is a well known and studied task in the speech proce...
research
04/07/2023

ArmanTTS single-speaker Persian dataset

TTS, or text-to-speech, is a complicated process that can be accomplishe...
research
02/24/2022

Closing the Gap between Single-User and Multi-User VoiceFilter-Lite

VoiceFilter-Lite is a speaker-conditioned voice separation model that pl...
research
06/16/2021

Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain

Non-autoregressive (NAR) models have achieved a large inference computat...
research
03/05/2020

Statistical Context-Dependent Units Boundary Correction for Corpus-based Unit-Selection Text-to-Speech

In this study, we present an innovative technique for speaker adaptation...

Please sign up or login with your details

Forgot password? Click here to reset