Disentangled Speech Embeddings using Cross-modal Self-supervision

02/20/2020
by   Arsha Nagrani, et al.
17

The objective of this paper is to learn representations of speaker identity without access to manually annotated data. To do so, we develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video. The key idea behind our approach is to tease apart—without annotation—the representations of linguistic content and speaker identity. We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors, offering the potential for greater generalisation to novel combinations of content and identity and ultimately producing speaker identity representations that are more robust. We train our method on a large-scale audio-visual dataset of talking heads `in the wild', and demonstrate its efficacy by evaluating the learned speaker representations for standard speaker recognition performance.

READ FULL TEXT
research
04/29/2020

Seeing voices and hearing voices: learning discriminative embeddings using cross-modal self-supervision

The goal of this work is to train discriminative cross-modal embeddings ...
research
09/24/2022

Unsupervised active speaker detection in media content using cross-modal information

We present a cross-modal unsupervised framework for active speaker detec...
research
09/21/2018

Perfect match: Improved cross-modal embeddings for audio-visual synchronisation

This paper proposes a new strategy for learning powerful cross-modal emb...
research
01/18/2022

Leveraging Real Talking Faces via Self-Supervision for Robust Forgery Detection

One of the most pressing challenges for the detection of face-manipulate...
research
02/11/2021

A Multi-View Approach To Audio-Visual Speaker Verification

Although speaker verification has conventionally been an audio-only task...
research
02/18/2022

VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge transfer from voice conversion

Though significant progress has been made for speaker-dependent Video-to...
research
04/28/2020

Cross-modal Speaker Verification and Recognition: A Multilingual Perspective

Recent years have seen a surge in finding association between faces and ...

Please sign up or login with your details

Forgot password? Click here to reset