Audio-Visual Speech Separation Using Cross-Modal Correspondence Loss

03/02/2021
by   Naoki Makishima, et al.
0

We present an audio-visual speech separation learning method that considers the correspondence between the separated signals and the visual signals to reflect the speech characteristics during training. Audio-visual speech separation is a technique to estimate the individual speech signals from a mixture using the visual signals of the speakers. Conventional studies on audio-visual speech separation mainly train the separation model on the audio-only loss, which reflects the distance between the source signals and the separated signals. However, conventional losses do not reflect the characteristics of the speech signals, including the speaker's characteristics and phonetic information, which leads to distortion or remaining noise. To address this problem, we propose the cross-modal correspondence (CMC) loss, which is based on the cooccurrence of the speech signal and the visual signal. Since the visual signal is not affected by background noise and contains speaker and phonetic information, using the CMC loss enables the audio-visual speech separation model to remove noise while preserving the speech characteristics. Experimental results demonstrate that the proposed method learns the cooccurrence on the basis of CMC loss, which improves separation performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/10/2018

Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation

We present a joint audio-visual model for isolating a single speech sign...
research
03/25/2021

Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual Speech Separation

In this paper, we address the problem of separating individual speech si...
research
03/29/2022

Disentangling speech from surroundings in a neural audio codec

We present a method to separate speech signals from noisy environments i...
research
09/16/2019

Audio-Visual Speech Separation and Dereverberation with a Two-Stage Multimodal Network

Background noise, interfering speech and room reverberation frequently d...
research
10/26/2022

Acoustically-Driven Phoneme Removal That Preserves Vocal Affect Cues

In this paper, we propose a method for removing linguistic information f...
research
07/01/2021

Audiovisual Singing Voice Separation

Separating a song into vocal and accompaniment components is an active r...
research
10/18/2022

BirdSoundsDenoising: Deep Visual Audio Denoising for Bird Sounds

Audio denoising has been explored for decades using both traditional and...

Please sign up or login with your details

Forgot password? Click here to reset