Learnable PINs: Cross-Modal Embeddings for Person Identity

05/02/2018
by   Arsha Nagrani, et al.
0

We propose and investigate an identity sensitive joint embedding of face and voice. Such an embedding enables cross-modal retrieval from voice to face and from face to voice. We make the following four contributions: first, we show that the embedding can be learnt from videos of talking faces, without requiring any identity labels, using a form of cross-modal self-supervision; second, we develop a curriculum learning schedule for hard negative mining targeted to this task, that is essential for learning to proceed successfully; third, we demonstrate and evaluate cross-modal retrieval for identities unseen and unheard during training over a number of scenarios and establish a benchmark for this novel task; finally, we show an application of using the joint embedding for automatically retrieving and labelling characters in TV dramas.

READ FULL TEXT

page 5

page 15

page 16

research
11/21/2019

Voice-Face Cross-modal Matching and Retrieval: A Benchmark

Cross-modal associations between voice and face from a person can be lea...
research
11/21/2022

TimbreCLIP: Connecting Timbre to Text and Images

We present work in progress on TimbreCLIP, an audio-text cross modal emb...
research
04/01/2018

Seeing Voices and Hearing Faces: Cross-modal biometric matching

We introduce a seemingly impossible task: given only an audio clip of so...
research
08/22/2022

Learning Branched Fusion and Orthogonal Projection for Face-Voice Association

Recent years have seen an increased interest in establishing association...
research
04/21/2021

Voice2Mesh: Cross-Modal 3D Face Model Generation from Voices

This work focuses on the analysis that whether 3D face models can be lea...
research
12/20/2021

Fusion and Orthogonal Projection for Improved Face-Voice Association

We study the problem of learning association between face and voice, whi...
research
04/28/2022

Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast

We present an approach to learn voice-face representations from the talk...

Please sign up or login with your details

Forgot password? Click here to reset