Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation

04/10/2018
by   Ariel Ephrat, et al.
0

We present a joint audio-visual model for isolating a single speech signal from a mixture of sounds such as other speakers and background noise. Solving this task using only audio as input is extremely challenging and does not provide an association of the separated speech signals with speakers in the video. In this paper, we present a deep network-based model that incorporates both visual and auditory signals to solve this task. The visual features are used to "focus" the audio on desired speakers in a scene and to improve the speech separation quality. To train our joint audio-visual model, we introduce AVSpeech, a new dataset comprised of thousands of hours of video segments from the Web. We demonstrate the applicability of our method to classic speech separation tasks, as well as real-world scenarios involving heated interviews, noisy bars, and screaming children, only requiring the user to specify the face of the person in the video whose speech they want to isolate. Our method shows clear advantage over state-of-the-art audio-only speech separation in cases of mixed speech. In addition, our model, which is speaker-independent (trained once, applicable to any speaker), produces better results than recent audio-visual speech separation methods that are speaker-dependent (require training a separate model for each speaker of interest).

READ FULL TEXT

page 1

page 4

page 5

page 6

page 7

page 9

page 10

research
05/14/2020

FaceFilter: Audio-visual speech separation using still images

The objective of this paper is to separate a target speaker's speech fro...
research
03/02/2021

Audio-Visual Speech Separation Using Cross-Modal Correspondence Loss

We present an audio-visual speech separation learning method that consid...
research
11/29/2020

Audio-visual Speech Separation with Adversarially Disentangled Visual Representation

Speech separation aims to separate individual voice from an audio mixtur...
research
05/31/2017

Putting a Face to the Voice: Fusing Audio and Visual Signals Across a Video to Determine Speakers

In this paper, we present a system that associates faces with voices in ...
research
03/31/2016

Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion

Speaker diarization consists of assigning speech signals to people engag...
research
07/02/2019

WHAM!: Extending Speech Separation to Noisy Environments

Recent progress in separating the speech signals from multiple overlappi...
research
03/29/2022

Disentangling speech from surroundings in a neural audio codec

We present a method to separate speech signals from noisy environments i...

Please sign up or login with your details

Forgot password? Click here to reset