Self-supervised learning for audio-visual speaker diarization

by   Yifan Ding, et al.

Speaker diarization, which is to find the speech segments of specific speakers, has been widely used in human-centered applications such as video conferences or human-computer interaction systems. In this paper, we propose a self-supervised audio-video synchronization learning method to address the problem of speaker diarization without massive labeling effort. We improve the previous approaches by introducing two new loss functions: the dynamic triplet loss and the multinomial loss. We test them on a real-world human-computer interaction system and the results show our best model yields a remarkable gain of +8 introduce a new large scale audio-video corpus designed to fill the vacancy of audio-video datasets in Chinese.


page 1

page 2

page 3

page 4


Self-Supervised Learning of Audio-Visual Objects from Video

Our objective is to transform a video into a set of discrete audio-visua...

Chronological Self-Training for Real-Time Speaker Diarization

Diarization partitions an audio stream into segments based on the voices...

Co-Training of Audio and Video Representations from Self-Supervised Temporal Synchronization

There is a natural correlation between the visual and auditive elements ...

LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading

The aim of this work is to investigate the impact of crossmodal self-sup...

I'm Sorry for Your Loss: Spectrally-Based Audio Distances Are Bad at Pitch

Growing research demonstrates that synthetic failure modes imply poor ge...

AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection

Active speaker detection is an important component in video analysis alg...

Please sign up or login with your details

Forgot password? Click here to reset