Visual-Aware Text-to-Speech

06/21/2023
by   Mohan Zhou, et al.
0

Dynamically synthesizing talking speech that actively responds to a listening head is critical during the face-to-face interaction. For example, the speaker could take advantage of the listener's facial expression to adjust the tones, stressed syllables, or pauses. In this work, we present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and sequential visual feedback (e.g., nod, smile) of the listener in face-to-face communication. Different from traditional text-to-speech, VA-TTS highlights the impact of visual modality. On this newly-minted task, we devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis. Extensive experiments on multimodal conversation dataset ViCo-X verify our proposal for generating more natural audio with scenario-appropriate rhythm and prosody.

READ FULL TEXT

page 2

page 4

research
08/09/2021

AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Person

Automatically generating videos in which synthesized speech is synchroni...
research
05/22/2016

Textual Paralanguage and its Implications for Marketing Communications

Both face-to-face communication and communication in online environments...
research
03/26/2021

DBATES: DataBase of Audio features, Text, and visual Expressions in competitive debate Speeches

In this work, we present a database of multimodal communication features...
research
12/27/2021

Responsive Listening Head Generation: A Benchmark Dataset and Baseline

Responsive listening during face-to-face conversations is a critical ele...
research
05/31/2022

Text/Speech-Driven Full-Body Animation

Due to the increasing demand in films and games, synthesizing 3D avatar ...
research
11/24/2022

On the Linguistic and Computational Requirements for Creating Face-to-Face Multimodal Human-Machine Interaction

In this study, conversations between humans and avatars are linguistical...
research
07/22/2022

Visual Speech-Aware Perceptual 3D Facial Expression Reconstruction from Videos

The recent state of the art on monocular 3D face reconstruction from ima...

Please sign up or login with your details

Forgot password? Click here to reset