VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion

02/18/2022
by   Disong Wang, et al.
0

Though significant progress has been made for speaker-dependent Video-to-Speech (VTS) synthesis, little attention is devoted to multi-speaker VTS that can map silent video to speech, while allowing flexible control of speaker identity, all in a single system. This paper proposes a novel multi-speaker VTS system based on cross-modal knowledge transfer from voice conversion (VC), where vector quantization with contrastive predictive coding (VQCPC) is used for the content encoder of VC to derive discrete phoneme-like acoustic units, which are transferred to a Lip-to-Index (Lip2Ind) network to infer the index sequence of acoustic units. The Lip2Ind network can then substitute the content encoder of VC to form a multi-speaker VTS system to convert silent video to acoustic units for reconstructing accurate spoken content. The VTS system also inherits the advantages of VC by using a speaker encoder to produce speaker representations to effectively control the speaker identity of generated speech. Extensive evaluations verify the effectiveness of proposed approach, which can be applied in both constrained vocabulary and open vocabulary conditions, achieving state-of-the-art performance in generating high-quality speech with high naturalness, intelligibility and speaker similarity. Our demo page is released here: https://wendison.github.io/VCVTS-demo/

READ FULL TEXT
research
05/20/2021

Speaker disentanglement in video-to-speech conversion

The task of video-to-speech aims to translate silent video of lip moveme...
research
06/15/2022

End-to-End Voice Conversion with Information Perturbation

The ideal goal of voice conversion is to convert the source speaker's sp...
research
09/06/2020

Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling

This paper proposes an any-to-many location-relative, sequence-to-sequen...
research
05/09/2023

Zero-shot personalized lip-to-speech synthesis with face image based voice control

Lip-to-Speech (Lip2Speech) synthesis, which predicts corresponding speec...
research
01/10/2023

UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion

Text-to-speech (TTS) and voice conversion (VC) are two different tasks b...
research
06/08/2020

Zero resource speech synthesis using transcripts derived from perceptual acoustic units

Zerospeech synthesis is the task of building vocabulary independent spee...
research
02/20/2020

Disentangled Speech Embeddings using Cross-modal Self-supervision

The objective of this paper is to learn representations of speaker ident...

Please sign up or login with your details

Forgot password? Click here to reset