Improving robustness of one-shot voice conversion with deep discriminative speaker encoder

06/19/2021
by   Hongqiang Du, et al.
0

One-shot voice conversion has received significant attention since only one utterance from source speaker and target speaker respectively is required. Moreover, source speaker and target speaker do not need to be seen during training. However, available one-shot voice conversion approaches are not stable for unseen speakers as the speaker embedding extracted from one utterance of an unseen speaker is not reliable. In this paper, we propose a deep discriminative speaker encoder to extract speaker embedding from one utterance more effectively. Specifically, the speaker encoder first integrates residual network and squeeze-and-excitation network to extract discriminative speaker information in frame level by modeling frame-wise and channel-wise interdependence in features. Then attention mechanism is introduced to further emphasize speaker related information via assigning different weights to frame level speaker information. Finally a statistic pooling layer is used to aggregate weighted frame level speaker information to form utterance level speaker embedding. The experimental results demonstrate that our proposed speaker encoder can improve the robustness of one-shot voice conversion for unseen speakers and outperforms baseline systems in terms of speech quality and speaker similarity.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 5

research
10/24/2020

GAZEV: GAN-Based Zero-Shot Voice Conversion over Non-parallel Speech Corpus

Non-parallel many-to-many voice conversion is recently attract-ing huge ...
research
10/27/2020

FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and Fusing Fine-Grained Voice Fragments With Attention

Any-to-any voice conversion aims to convert the voice from and to any sp...
research
06/28/2022

Speaker Verification in Multi-Speaker Environments Using Temporal Feature Fusion

Verifying the identity of a speaker is crucial in modern human-machine i...
research
09/15/2023

Controllable Residual Speaker Representation for Voice Conversion

Recently, there have been significant advancements in voice conversion, ...
research
03/30/2022

Enhancing Zero-Shot Many to Many Voice Conversion with Self-Attention VAE

Variational auto-encoder(VAE) is an effective neural network architectur...
research
05/03/2021

AvaTr: One-Shot Speaker Extraction with Transformers

To extract the voice of a target speaker when mixed with a variety of ot...
research
10/30/2019

Mixture factorized auto-encoder for unsupervised hierarchical deep factorization of speech signal

Speech signal is constituted and contributed by various informative fact...

Please sign up or login with your details

Forgot password? Click here to reset