Seeking the Shape of Sound: An Adaptive Framework for Learning Voice-Face Association

03/12/2021
by   Peisong Wen, et al.
1

Nowadays, we have witnessed the early progress on learning the association between voice and face automatically, which brings a new wave of studies to the computer vision community. However, most of the prior arts along this line (a) merely adopt local information to perform modality alignment and (b) ignore the diversity of learning difficulty across different subjects. In this paper, we propose a novel framework to jointly address the above-mentioned issues. Targeting at (a), we propose a two-level modality alignment loss where both global and local information are considered. Compared with the existing methods, we introduce a global loss into the modality alignment process. The global component of the loss is driven by the identity classification. Theoretically, we show that minimizing the loss could maximize the distance between embeddings across different identities while minimizing the distance between embeddings belonging to the same identity, in a global sense (instead of a mini-batch). Targeting at (b), we propose a dynamic reweighting scheme to better explore the hard but valuable identities while filtering out the unlearnable identities. Experiments show that the proposed method outperforms the previous methods in multiple settings, including voice-face matching, verification and retrieval.

READ FULL TEXT

page 3

page 8

research
05/16/2019

Learning Robust 3D Face Reconstruction and Discriminative Identity Representation

3D face reconstruction from a single 2D image is a very important topic ...
research
12/20/2021

Fusion and Orthogonal Projection for Improved Face-Voice Association

We study the problem of learning association between face and voice, whi...
research
08/22/2022

Learning Branched Fusion and Orthogonal Projection for Face-Voice Association

Recent years have seen an increased interest in establishing association...
research
08/15/2020

BroadFace: Looking at Tens of Thousands of People at Once for Face Recognition

The datasets of face recognition contain an enormous number of identitie...
research
09/20/2018

Global and Local Consistent Wavelet-domain Age Synthesis

Age synthesis is a challenging task due to the complicated and non-linea...
research
06/25/2020

PropagationNet: Propagate Points to Curve to Learn Structure Information

Deep learning technique has dramatically boosted the performance of face...
research
02/27/2023

A Comparative Analysis Of Latent Regressor Losses For Singing Voice Conversion

Previous research has shown that established techniques for spoken voice...

Please sign up or login with your details

Forgot password? Click here to reset