A Multi-View Approach To Audio-Visual Speaker Verification

02/11/2021
by   Leda Sarı, et al.
0

Although speaker verification has conventionally been an audio-only task, some practical applications provide both audio and visual streams of input. In these cases, the visual stream provides complementary information and can often be leveraged in conjunction with the acoustics of speech to improve verification performance. In this study, we explore audio-visual approaches to speaker verification, starting with standard fusion techniques to learn joint audio-visual (AV) embeddings, and then propose a novel approach to handle cross-modal verification at test time. Specifically, we investigate unimodal and concatenation based AV fusion and report the lowest AV equal error rate (EER) of 0.7 lack the ability to do cross-modal verification, we introduce a multi-view model which uses a shared classifier to map audio and video into the same space. This new approach achieves 28 testing condition of cross-modal verification.

READ FULL TEXT
research
09/18/2019

Deep Latent Space Learning for Cross-modal Mapping of Audio and Visual Signals

We propose a novel deep training algorithm for joint representation of a...
research
08/23/2023

AdVerb: Visually Guided Audio Dereverberation

We present AdVerb, a novel audio-visual dereverberation framework that u...
research
09/13/2023

PIAVE: A Pose-Invariant Audio-Visual Speaker Extraction Network

It is common in everyday spoken communication that we look at the turnin...
research
09/13/2023

Weakly-Supervised Multi-Task Learning for Audio-Visual Speaker Verification

In this paper, we present a methodology for achieving robust multimodal ...
research
06/28/2019

Lipper: Synthesizing Thy Speech using Multi-View Lipreading

Lipreading has a lot of potential applications such as in the domain of ...
research
02/20/2020

Disentangled Speech Embeddings using Cross-modal Self-supervision

The objective of this paper is to learn representations of speaker ident...
research
02/11/2021

A Fractal Approach to Characterize Emotions in Audio and Visual Domain: A Study on Cross-Modal Interaction

It is already known that both auditory and visual stimulus is able to co...

Please sign up or login with your details

Forgot password? Click here to reset