Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision

04/29/2020
by   Soo-Whan Chung, et al.
11

The goal of this work is to train discriminative cross-modal embeddings without access to manually annotated data. Recent advances in self-supervised learning have shown that effective representations can be learnt from natural cross-modal synchrony. We build on earlier work to train embeddings that are more discriminative for uni-modal downstream tasks. To this end, we propose a novel training strategy that not only optimises metrics across modalities, but also enforces intra-class feature separation within each of the modalities. The effectiveness of the method is demonstrated on two downstream tasks: lip reading using the features trained on audio-visual synchronisation, and speaker recognition using the features trained for cross-modal biometric matching. The proposed method outperforms state-of-the-art self-supervised baselines by a signficant margin.

READ FULL TEXT
research
09/21/2018

Perfect match: Improved cross-modal embeddings for audio-visual synchronisation

This paper proposes a new strategy for learning powerful cross-modal emb...
research
02/20/2020

Disentangled Speech Embeddings using Cross-modal Self-supervision

The objective of this paper is to learn representations of speaker ident...
research
03/25/2021

Vectorization and Rasterization: Self-Supervised Learning for Sketch and Handwriting

Self-supervised learning has gained prominence due to its efficacy at le...
research
07/31/2023

Latent Masking for Multimodal Self-supervised Learning in Health Timeseries

Limited availability of labeled data for machine learning on biomedical ...
research
10/20/2021

Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos

We introduce the task of spatially localizing narrated interactions in v...
research
01/21/2021

Learning rich touch representations through cross-modal self-supervision

The sense of touch is fundamental in several manipulation tasks, but rar...
research
11/01/2021

Self-Supervised Radio-Visual Representation Learning for 6G Sensing

In future 6G cellular networks, a joint communication and sensing protoc...

Please sign up or login with your details

Forgot password? Click here to reset