ConVoice: Real-Time Zero-Shot Voice Style Transfer with Convolutional Network

05/15/2020
by   Yurii Rebryk, et al.
0

We propose a neural network for zero-shot voice conversion (VC) without any parallel or transcribed data. Our approach uses pre-trained models for automatic speech recognition (ASR) and speaker embedding, obtained from a speaker verification task. Our model is fully convolutional and non-autoregressive except for a small pre-trained recurrent neural network for speaker encoding. ConVoice can convert speech of any length without compromising quality due to its convolutional architecture. Our model has comparable quality to similar state-of-the-art models while being extremely fast.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/17/2021

Improving Zero-shot Voice Style Transfer via Disentangled Representation Learning

Voice style transfer, also called voice conversion, seeks to modify one ...
research
04/18/2019

TTS Skins: Speaker Conversion via ASR

We present a fully convolutional wav-to-wav network for converting betwe...
research
05/19/2022

End-to-End Zero-Shot Voice Style Transfer with Location-Variable Convolutions

Zero-shot voice conversion is becoming an increasingly popular research ...
research
05/31/2023

Zero-Shot Automatic Pronunciation Assessment

Automatic Pronunciation Assessment (APA) is vital for computer-assisted ...
research
07/30/2023

HierVST: Hierarchical Adaptive Zero-shot Voice Style Transfer

Despite rapid progress in the voice style transfer (VST) field, recent z...
research
12/22/2017

On Using Backpropagation for Speech Texture Generation and Voice Conversion

Inspired by recent work on neural network image generation which rely on...
research
08/28/2023

Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech

For personalized speech generation, a neural text-to-speech (TTS) model ...

Please sign up or login with your details

Forgot password? Click here to reset