Seeing Voices and Hearing Faces: Cross-modal biometric matching

04/01/2018
by   Arsha Nagrani, et al.
0

We introduce a seemingly impossible task: given only an audio clip of someone speaking, decide which of two face images is the speaker. In this paper we study this, and a number of related cross-modal tasks, aimed at answering the question: how much can we infer from the voice about the face and vice versa? We study this task "in the wild", employing the datasets that are now publicly available for face recognition from static images (VGGFace) and speaker identification from audio (VoxCeleb). These provide training and testing scenarios for both static and dynamic testing of cross-modal matching. We make the following contributions: (i) we introduce CNN architectures for both binary and multi-way cross-modal face and audio matching, (ii) we compare dynamic testing (where video information is available, but the audio is not from the same video) with static testing (where only a single still image is available), and (iii) we use human testing as a baseline to calibrate the difficulty of the task. We show that a CNN can indeed be trained to solve this task in both the static and dynamic scenarios, and is even well above chance on 10-way classification of the face given the voice. The CNN matches human performance on easy examples (e.g. different gender across faces) but exceeds human performance on more challenging examples (e.g. faces with the same gender, age and nationality).

READ FULL TEXT

page 8

page 12

page 13

research
05/25/2019

Reconstructing faces from voices

Voice profiling aims at inferring various human parameters from their sp...
research
05/02/2018

Learnable PINs: Cross-Modal Embeddings for Person Identity

We propose and investigate an identity sensitive joint embedding of face...
research
04/28/2020

Cross-modal Speaker Verification and Recognition: A Multilingual Perspective

Recent years have seen a surge in finding association between faces and ...
research
03/18/2022

Cross-Modal Perceptionist: Can Face Geometry be Gleaned from Voices?

This work digs into a root question in human perception: can face geomet...
research
02/27/2023

Cross-modal Face- and Voice-style Transfer

Image-to-image translation and voice conversion enable the generation of...
research
11/21/2019

Voice-Face Cross-modal Matching and Retrieval: A Benchmark

Cross-modal associations between voice and face from a person can be lea...
research
03/29/2022

NeuraGen-A Low-Resource Neural Network based approach for Gender Classification

Human voice is the source of several important information. This is in t...

Please sign up or login with your details

Forgot password? Click here to reset