Speaker-Independent Speech-Driven Visual Speech Synthesis using Domain-Adapted Acoustic Models

05/15/2019
by   Ahmed Hussen Abdelaziz, et al.
0

Speech-driven visual speech synthesis involves mapping features extracted from acoustic speech to the corresponding lip animation controls for a face model. This mapping can take many forms, but a powerful approach is to use deep neural networks (DNNs). However, a limitation is the lack of synchronized audio, video, and depth data required to reliably train the DNNs, especially for speaker-independent models. In this paper, we investigate adapting an automatic speech recognition (ASR) acoustic model (AM) for the visual speech synthesis problem. We train the AM on ten thousand hours of audio-only data. The AM is then adapted to the visual speech synthesis domain using ninety hours of synchronized audio-visual speech. Using a subjective assessment test, we compared the performance of the AM-initialized DNN to one with a random initialization. The results show that viewers significantly prefer animations generated from the AM-initialized DNN than the ones generated using the randomly initialized model. We conclude that visual speech synthesis can significantly benefit from the powerful representation of speech in the ASR acoustic models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/13/2019

Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments

Speech recognition in cocktail-party environments remains a significant ...
research
01/10/2017

Multi-task Learning Of Deep Neural Networks For Audio Visual Automatic Speech Recognition

Multi-task learning (MTL) involves the simultaneous training of two or m...
research
01/26/2016

Recurrent Neural Network Postfilters for Statistical Parametric Speech Synthesis

In the last two years, there have been numerous papers that have looked ...
research
04/06/2020

Vocoder-Based Speech Synthesis from Silent Videos

Both acoustic and visual information influence human perception of speec...
research
02/23/2016

The IBM 2016 Speaker Recognition System

In this paper we describe the recent advancements made in the IBM i-vect...
research
09/13/2023

Distinguishing Neural Speech Synthesis Models Through Fingerprints in Speech Waveforms

Recent strides in neural speech synthesis technologies, while enjoying w...
research
01/19/2012

Progress in animation of an EMA-controlled tongue model for acoustic-visual speech synthesis

We present a technique for the animation of a 3D kinematic tongue model,...

Please sign up or login with your details

Forgot password? Click here to reset