Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast

04/28/2022
by   Boqing Zhu, et al.
7

We present an approach to learn voice-face representations from the talking face videos, without any identity labels. Previous works employ cross-modal instance discrimination tasks to establish the correlation of voice and face. These methods neglect the semantic content of different videos, introducing false-negative pairs as training noise. Furthermore, the positive pairs are constructed based on the natural correlation between audio clips and visual frames. However, this correlation might be weak or inaccurate in a large amount of real-world data, which leads to deviating positives into the contrastive paradigm. To address these issues, we propose the cross-modal prototype contrastive learning (CMPC), which takes advantage of contrastive methods and resists adverse effects of false negatives and deviate positives. On one hand, CMPC could learn the intra-class invariance by constructing semantic-wise positives via unsupervised clustering in different modalities. On the other hand, by comparing the similarities of cross-modal instances from that of cross-modal prototypes, we dynamically recalibrate the unlearnable instances' contribution to overall loss. Experiments show that the proposed approach outperforms state-of-the-art unsupervised methods on various voice-face association evaluation protocols. Additionally, in the low-shot supervision setting, our method also has a significant improvement compared to previous instance-wise contrastive learning.

READ FULL TEXT
research
04/27/2020

Audio-Visual Instance Discrimination with Cross-Modal Agreement

We present a self-supervised learning approach to learn audio-visual rep...
research
03/20/2023

MXM-CLR: A Unified Framework for Contrastive Learning of Multifold Cross-Modal Representations

Multifold observations are common for different data modalities, e.g., a...
research
05/02/2018

Learnable PINs: Cross-Modal Embeddings for Person Identity

We propose and investigate an identity sensitive joint embedding of face...
research
05/09/2023

Exploiting Pseudo Image Captions for Multimodal Summarization

Cross-modal contrastive learning in vision language pretraining (VLP) fa...
research
03/29/2021

Robust Audio-Visual Instance Discrimination

We present a self-supervised learning method to learn audio and video re...
research
05/08/2023

Vision Langauge Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation

Cross-modal contrastive learning in vision language pretraining (VLP) fa...
research
10/20/2021

Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos

We introduce the task of spatially localizing narrated interactions in v...

Please sign up or login with your details

Forgot password? Click here to reset