Encoder-decoder multimodal speaker change detection

06/01/2023
by   Jee-weon Jung, et al.
0

The task of speaker change detection (SCD), which detects points where speakers change in an input, is essential for several applications. Several studies solved the SCD task using audio inputs only and have shown limited performance. Recently, multimodal SCD (MMSCD) models, which utilise text modality in addition to audio, have shown improved performance. In this study, the proposed model are built upon two main proposals, a novel mechanism for modality fusion and the adoption of a encoder-decoder architecture. Different to previous MMSCD works that extract speaker embeddings from extremely short audio segments, aligned to a single word, we use a speaker embedding extracted from 1.5s. A transformer decoder layer further improves the performance of an encoder-only MMSCD model. The proposed model achieves state-of-the-art results among studies that report SCD performance and is also on par with recent work that combines SCD with automatic speech recognition via human transcription.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/05/2021

End-to-End Speaker-Attributed ASR with Transformer

This paper presents our recent effort on end-to-end speaker-attributed a...
research
05/13/2021

Audio Captioning with Composition of Acoustic and Semantic Information

Generating audio captions is a new research area that combines audio and...
research
01/25/2022

Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition

Audio-visual automatic speech recognition (AV-ASR) extends the speech re...
research
06/27/2022

Sequence-level Speaker Change Detection with Difference-based Continuous Integrate-and-fire

Speaker change detection is an important task in multi-party interaction...
research
10/27/2022

On Out-of-Distribution Detection for Audio with Deep Nearest Neighbors

Out-of-distribution (OOD) detection is concerned with identifying data p...
research
05/10/2022

Best of Both Worlds: Multi-task Audio-Visual Automatic Speech Recognition and Active Speaker Detection

Under noisy conditions, automatic speech recognition (ASR) can greatly b...
research
08/08/2020

Exploring the Use of an Unsupervised Autoregressive Model as a Shared Encoder for Text-Dependent Speaker Verification

In this paper, we propose a novel way of addressing text-dependent autom...

Please sign up or login with your details

Forgot password? Click here to reset