End-to-End Active Speaker Detection

03/27/2022
by   Juan Leon Alcazar, et al.
5

Recent advances in the Active Speaker Detection (ASD) problem build upon a two-stage process: feature extraction and spatio-temporal context aggregation. In this paper, we propose an end-to-end ASD workflow where feature learning and contextual predictions are jointly learned. Our end-to-end trainable network simultaneously learns multi-modal embeddings and aggregates spatio-temporal context. This results in more suitable feature representations and improved performance in the ASD task. We also introduce interleaved graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem. Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance. Finally, we design a weakly-supervised strategy, which demonstrates that the ASD problem can also be approached by utilizing audiovisual data but relying exclusively on audio annotations. We achieve this by modelling the direct relationship between the audio signal and the possible sound sources (speakers), as well as introducing a contrastive loss.

READ FULL TEXT

page 2

page 3

page 4

research
03/29/2016

Cross-modal Supervision for Learning Active Speaker Detection in Video

In this paper, we show how to use audio to supervise the learning of act...
research
11/04/2019

pyannote.audio: neural building blocks for speaker diarization

We introduce pyannote.audio, an open-source toolkit written in Python fo...
research
01/11/2021

MAAS: Multi-modal Assignation for Active Speaker Detection

Active speaker detection requires a solid integration of multi-modal cue...
research
07/27/2022

End-To-End Audiovisual Feature Fusion for Active Speaker Detection

Active speaker detection plays a vital role in human-machine interaction...
research
09/13/2023

Weakly-Supervised Multi-Task Learning for Audio-Visual Speaker Verification

In this paper, we present a methodology for achieving robust multimodal ...
research
06/13/2023

Contrastive Learning-Based Audio to Lyrics Alignment for Multiple Languages

Lyrics alignment gained considerable attention in recent years. State-of...
research
06/22/2017

Deep Speaker Verification: Do We Need End to End?

End-to-end learning treats the entire system as a whole adaptable black ...

Please sign up or login with your details

Forgot password? Click here to reset