DeepAI AI Chat
Log In Sign Up

Self-supervised Learning of Visual Speech Features with Audiovisual Speech Enhancement

by   Zakaria Aldeneh, et al.
University of Michigan
Apple Inc.

We present an introspection of an audiovisual speech enhancement model. In particular, we focus on interpreting how a neural audiovisual speech enhancement model uses visual cues to improve the quality of the target speech signal. We show that visual features provide not only high-level information about speech activity, i.e. speech vs. no speech, but also fine-grained visual information about the place of articulation. An interesting byproduct of this finding is that the learned visual embeddings can be used as features for other visual speech applications. We demonstrate the effectiveness of the learned visual representations for classifying visemes (the visual analogy to phonemes). Our results provide insight into important aspects of audiovisual speech enhancement and demonstrate how such models can be used for self-supervision tasks for visual speech applications.


page 1

page 2

page 3

page 4


Exploring WavLM on Speech Enhancement

There is a surge in interest in self-supervised learning approaches for ...

ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement

Prior works on improving speech quality with visual input typically stud...

Perceive and predict: self-supervised speech representation based loss functions for speech enhancement

Recent work in the domain of speech enhancement has explored the use of ...

Phonetic Feedback for Speech Enhancement With and Without Parallel Speech Data

While deep learning systems have gained significant ground in speech enh...

Guided Speech Enhancement Network

High quality speech capture has been widely studied for both voice commu...

Feature Learning and Ensemble Pre-Tasks Based Self-Supervised Speech Denoising and Dereverberation

Self-supervised learning (SSL) achieves great success in monaural speech...