Speaker disentanglement in video-to-speech conversion

05/20/2021
by   Dan Oneata, et al.
0

The task of video-to-speech aims to translate silent video of lip movement to its corresponding audio signal. Previous approaches to this task are generally limited to the case of a single speaker, but a method that accounts for multiple speakers is desirable as it allows to i) leverage datasets with multiple speakers or few samples per speaker; and ii) control speaker identity at inference time. In this paper, we introduce a new video-to-speech architecture and explore ways of extending it to the multi-speaker scenario: we augment the network with an additional speaker-related input, through which we feed either a discrete identity or a speaker embedding. Interestingly, we observe that the visual encoder of the network is capable of learning the speaker identity from the lip region of the face alone. To better disentangle the two inputs – linguistic content and speaker identity – we add adversarial losses that dispel the identity from the video embeddings. To the best of our knowledge, the proposed method is the first to provide important functionalities such as i) control of the target voice and ii) speech synthesis for unseen identities over the state-of-the-art, while still maintaining the intelligibility of the spoken output.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/04/2020

MIRNet: Learning multiple identities representations in overlapped speech

Many approaches can derive information about a single speaker's identity...
research
02/18/2022

VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion

Though significant progress has been made for speaker-dependent Video-to...
research
12/01/2017

Speaker identification from the sound of the human breath

This paper examines the speaker identification potential of breath sound...
research
07/19/2023

An analysis on the effects of speaker embedding choice in non auto-regressive TTS

In this paper we introduce a first attempt on understanding how a non-au...
research
09/24/2021

Evaluating X-vector-based Speaker Anonymization under White-box Assessment

In the scenario of the Voice Privacy challenge, anonymization is achieve...
research
09/15/2023

PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech Using Natural Language Descriptions

We propose PromptTTS++, a prompt-based text-to-speech (TTS) synthesis sy...
research
06/03/2019

Problem-Agnostic Speech Embeddings for Multi-Speaker Text-to-Speech with SampleRNN

Text-to-speech (TTS) acoustic models map linguistic features into an aco...

Please sign up or login with your details

Forgot password? Click here to reset