Self-supervised learning for audio-visual speaker diarization

02/13/2020
by   Yifan Ding, et al.
10

Speaker diarization, which is to find the speech segments of specific speakers, has been widely used in human-centered applications such as video conferences or human-computer interaction systems. In this paper, we propose a self-supervised audio-video synchronization learning method to address the problem of speaker diarization without massive labeling effort. We improve the previous approaches by introducing two new loss functions: the dynamic triplet loss and the multinomial loss. We test them on a real-world human-computer interaction system and the results show our best model yields a remarkable gain of +8 introduce a new large scale audio-video corpus designed to fill the vacancy of audio-video datasets in Chinese.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/10/2020

Self-Supervised Learning of Audio-Visual Objects from Video

Our objective is to transform a video into a set of discrete audio-visua...
research
08/05/2022

Chronological Self-Training for Real-Time Speaker Diarization

Diarization partitions an audio stream into segments based on the voices...
research
06/30/2018

Co-Training of Audio and Video Representations from Self-Supervised Temporal Synchronization

There is a natural correlation between the visual and auditive elements ...
research
06/27/2023

3D-Speaker: A Large-Scale Multi-Device, Multi-Distance, and Multi-Dialect Corpus for Speech Representation Disentanglement

Disentangling uncorrelated information in speech utterances is a crucial...
research
12/09/2021

LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading

The aim of this work is to investigate the impact of crossmodal self-sup...
research
12/08/2020

I'm Sorry for Your Loss: Spectrally-Based Audio Distances Are Bad at Pitch

Growing research demonstrates that synthetic failure modes imply poor ge...
research
01/05/2019

AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection

Active speaker detection is an important component in video analysis alg...

Please sign up or login with your details

Forgot password? Click here to reset