My lips are concealed: Audio-visual speech enhancement through obstructions

07/11/2019
by   Triantafyllos Afouras, et al.
3

Our objective is an audio-visual model for separating a single speaker from a mixture of sounds such as other speakers and background noise. Moreover, we wish to hear the speaker even when the visual cues are temporarily absent due to occlusion. To this end we introduce a deep audio-visual speech enhancement network that is able to separate a speaker's voice by conditioning on both the speaker's lip movements and/or a representation of their voice. The voice representation can be obtained by either (i) enrollment, or (ii) by self-enrollment – learning the representation on-the-fly given sufficient unobstructed visual input. The model is trained by blending audios, and by introducing artificial occlusions around the mouth region that prevent the visual modality from dominating. The method is speaker-independent, and we demonstrate it on real examples of speakers unheard (and unseen) during training. The method also improves over previous models in particular for cases of occlusion in the visual modality.

READ FULL TEXT

page 1

page 2

page 3

research
11/23/2017

Visual Speech Enhancement

When video is shot in noisy environment, the voice of a speaker seen in ...
research
04/11/2018

The Conversation: Deep Audio-Visual Speech Enhancement

Our goal is to isolate individual speakers from multi-talker simultaneou...
research
11/23/2017

Visual Speech Enhancement using Noise-Invariant Training

Visual speech enhancement is used on videos shot in noisy environments t...
research
11/06/2018

Face Landmark-based Speaker-Independent Audio-Visual Speech Enhancement in Multi-Talker Environments

In this paper, we address the problem of enhancing the speech of a speak...
research
08/22/2017

Seeing Through Noise: Visually Driven Speaker Separation and Enhancement

Isolating the voice of a specific person while filtering out other voice...
research
01/09/2023

Introducing Model Inversion Attacks on Automatic Speaker Recognition

Model inversion (MI) attacks allow to reconstruct average per-class repr...
research
09/23/2019

CochleaNet: A Robust Language-independent Audio-Visual Model for Speech Enhancement

Noisy situations cause huge problems for suffers of hearing loss as hear...

Please sign up or login with your details

Forgot password? Click here to reset