FaVoA: Face-Voice Association Favours Ambiguous Speaker Detection

09/01/2021
by   Hugo Carneiro, et al.
0

The strong relation between face and voice can aid active speaker detection systems when faces are visible, even in difficult settings, when the face of a speaker is not clear or when there are several people in the same scene. By being capable of estimating the frontal facial representation of a person from his/her speech, it becomes easier to determine whether he/she is a potential candidate for being classified as an active speaker, even in challenging cases in which no mouth movement is detected from any person in that same scene. By incorporating a face-voice association neural network into an existing state-of-the-art active speaker detection model, we introduce FaVoA (Face-Voice Association Ambiguous Speaker Detector), a neural network model that can correctly classify particularly ambiguous scenarios. FaVoA not only finds positive associations, but helps to rule out non-matching face-voice associations, where a face does not match a voice. Its use of a gated-bimodal-unit architecture for the fusion of those models offers a way to quantitatively determine how much each modality contributes to the classification.

READ FULL TEXT

page 9

page 11

research
03/29/2022

VoiceMe: Personalized voice generation in TTS

Novel text-to-speech systems can generate entirely new voices that were ...
research
05/25/2019

Reconstructing faces from voices

Voice profiling aims at inferring various human parameters from their sp...
research
05/17/2020

Multimodal Target Speech Separation with Voice and Face References

Target speech separation refers to isolating target speech from a multi-...
research
09/24/2022

Unsupervised active speaker detection in media content using cross-modal information

We present a cross-modal unsupervised framework for active speaker detec...
research
04/09/2019

Crossmodal Voice Conversion

Humans are able to imagine a person's voice from the person's appearance...
research
02/28/2020

Bio-Inspired Modality Fusion for Active Speaker Detection

Human beings have developed fantastic abilities to integrate information...
research
07/16/2021

Controlled AutoEncoders to Generate Faces from Voices

Multiple studies in the past have shown that there is a strong correlati...

Please sign up or login with your details

Forgot password? Click here to reset