VCSE: Time-Domain Visual-Contextual Speaker Extraction Network

10/09/2022
by   Junjie Li, et al.
0

Speaker extraction seeks to extract the target speech in a multi-talker scenario given an auxiliary reference. Such reference can be auditory, i.e., a pre-recorded speech, visual, i.e., lip movements, or contextual, i.e., phonetic sequence. References in different modalities provide distinct and complementary information that could be fused to form top-down attention on the target speaker. Previous studies have introduced visual and contextual modalities in a single model. In this paper, we propose a two-stage time-domain visual-contextual speaker extraction network named VCSE, which incorporates visual and self-enrolled contextual cues stage by stage to take full advantage of every modality. In the first stage, we pre-extract a target speech with visual cues and estimate the underlying phonetic sequence. In the second stage, we refine the pre-extracted target speech with the self-enrolled contextual cues. Experimental results on the real-world Lip Reading Sentences 3 (LRS3) database demonstrate that our proposed VCSE network consistently outperforms other state-of-the-art baselines.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/15/2020

Muse: Multi-modal target speaker extraction with visual cues

Speaker extraction algorithm relies on the speech sample from the target...
research
09/30/2021

USEV: Universal Speaker Extraction with Visual Cue

A speaker extraction algorithm seeks to extract the target speaker's voi...
research
09/15/2023

Audio-Visual Active Speaker Extraction for Sparsely Overlapped Multi-talker Speech

Target speaker extraction aims to extract the speech of a specific speak...
research
10/31/2022

ImagineNET: Target Speaker Extraction with Intermittent Visual Cue through Embedding Inpainting

The speaker extraction technique seeks to single out the voice of a targ...
research
09/02/2014

Visual Passwords Using Automatic Lip Reading

This paper presents a visual passwords system to increase security. The ...
research
06/21/2022

Towards Optimizing OCR for Accessibility

Visual cues such as structure, emphasis, and icons play an important rol...
research
06/05/2023

Rethinking the visual cues in audio-visual speaker extraction

The Audio-Visual Speaker Extraction (AVSE) algorithm employs parallel vi...

Please sign up or login with your details

Forgot password? Click here to reset