Cross-Modal Perceptionist: Can Face Geometry be Gleaned from Voices?

03/18/2022
by   Cho Ying Wu, et al.
3

This work digs into a root question in human perception: can face geometry be gleaned from one's voices? Previous works that study this question only adopt developments in image synthesis and convert voices into face images to show correlations, but working on the image domain unavoidably involves predicting attributes that voices cannot hint, including facial textures, hairstyles, and backgrounds. We instead investigate the ability to reconstruct 3D faces to concentrate on only geometry, which is much more physiologically grounded. We propose our analysis framework, Cross-Modal Perceptionist, under both supervised and unsupervised learning. First, we construct a dataset, Voxceleb-3D, which extends Voxceleb and includes paired voices and face meshes, making supervised learning possible. Second, we use a knowledge distillation mechanism to study whether face geometry can still be gleaned from voices without paired voices and 3D face data under limited availability of 3D face scans. We break down the core question into four parts and perform visual and numerical analyses as responses to the core question. Our findings echo those in physiology and neuroscience about the correlation between voices and facial structures. The work provides future human-centric cross-modal learning with explainable foundations. See our project page: https://choyingw.github.io/works/Voice2Mesh/index.html

READ FULL TEXT

page 5

page 6

page 7

page 13

page 14

page 16

research
04/21/2021

Voice2Mesh: Cross-Modal 3D Face Model Generation from Voices

This work focuses on the analysis that whether 3D face models can be lea...
research
04/01/2018

Seeing Voices and Hearing Faces: Cross-modal biometric matching

We introduce a seemingly impossible task: given only an audio clip of so...
research
03/21/2020

Cross-modal Deep Face Normals with Deactivable Skip Connections

We present an approach for estimating surface normals from in-the-wild c...
research
12/12/2020

Periocular in the Wild Embedding Learning with Cross-Modal Consistent Knowledge Distillation

Periocular biometric, or peripheral area of ocular, is a collaborative a...
research
05/10/2023

DaGAN++: Depth-Aware Generative Adversarial Network for Talking Head Video Generation

Predominant techniques on talking head generation largely depend on 2D i...
research
06/28/2023

A Dimensional Structure based Knowledge Distillation Method for Cross-Modal Learning

Due to limitations in data quality, some essential visual tasks are diff...
research
10/25/2016

Predicting First Impressions with Deep Learning

Describable visual facial attributes are now commonplace in human biomet...

Please sign up or login with your details

Forgot password? Click here to reset