PIAVE: A Pose-Invariant Audio-Visual Speaker Extraction Network

09/13/2023
by   Qinghua Liu, et al.
0

It is common in everyday spoken communication that we look at the turning head of a talker to listen to his/her voice. Humans see the talker to listen better, so do machines. However, previous studies on audio-visual speaker extraction have not effectively handled the varying talking face. This paper studies how to take full advantage of the varying talking face. We propose a Pose-Invariant Audio-Visual Speaker Extraction Network (PIAVE) that incorporates an additional pose-invariant view to improve audio-visual speaker extraction. Specifically, we generate the pose-invariant view from each original pose orientation, which enables the model to receive a consistent frontal view of the talker regardless of his/her head pose, therefore, forming a multi-view visual input for the speaker. Experiments on the multi-view MEAD and in-the-wild LRS3 dataset demonstrate that PIAVE outperforms the state-of-the-art and is more robust to pose variations.

READ FULL TEXT
research
02/11/2021

A Multi-View Approach To Audio-Visual Speaker Verification

Although speaker verification has conventionally been an audio-only task...
research
06/05/2023

Rethinking the visual cues in audio-visual speaker extraction

The Audio-Visual Speaker Extraction (AVSE) algorithm employs parallel vi...
research
06/24/2019

Who said that?: Audio-visual speaker diarisation of real-world meetings

The goal of this work is to determine 'who spoke when' in real-world mee...
research
02/02/2021

Multimodal Attention Fusion for Target Speaker Extraction

Target speaker extraction, which aims at extracting a target speaker's v...
research
12/04/2022

Tragic Talkers: A Shakespearean Sound- and Light-Field Dataset for Audio-Visual Machine Learning Research

3D audio-visual production aims to deliver immersive and interactive exp...
research
10/31/2022

ImagineNET: Target Speaker Extraction with Intermittent Visual Cue through Embedding Inpainting

The speaker extraction technique seeks to single out the voice of a targ...
research
06/28/2019

Lipper: Synthesizing Thy Speech using Multi-View Lipreading

Lipreading has a lot of potential applications such as in the domain of ...

Please sign up or login with your details

Forgot password? Click here to reset