End-to-end speaker diarization with transformer

12/14/2021
by   Yongquan Lai, et al.
0

Speaker diarization is connected to semantic segmentation in computer vision. Inspired from MaskFormer <cit.> which treats semantic segmentation as a set-prediction problem, we propose an end-to-end approach to predict a set of targets consisting of binary masks, vocal activities and speaker vectors. Our model, which we coin DiFormer, is mainly based on a speaker encoder and a feature pyramid network (FPN) module to extract multi-scale speaker features which are then fed into a transformer encoder-decoder to predict a set of diarization targets from learned query embedding. To account for temporal characteristics of speech signal, bidirectional LSTMs are inserted into the mask prediction module to improve temporal consistency. Our model handles unknown number of speakers, speech overlaps, as well as vocal activity detection in a unified way. Experiments on multimedia and meeting datasets demonstrate the effectiveness of our approach.

READ FULL TEXT
research
05/20/2020

End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors

End-to-end speaker diarization for an unknown number of speakers is addr...
research
06/08/2020

MultiSpeech: Multi-Speaker Text to Speech with Transformer

Transformer-based text to speech (TTS) model (e.g., Transformer TTS <cit...
research
01/02/2020

Speaker-aware speech-transformer

Recently, end-to-end (E2E) models become a competitive alternative to th...
research
08/27/2022

Target Speaker Voice Activity Detection with Transformers and Its Integration with End-to-End Neural Diarization

This paper describes a speaker diarization model based on target speaker...
research
01/11/2022

Pyramid Fusion Transformer for Semantic Segmentation

The recently proposed MaskFormer <cit.> gives a refreshed perspective on...
research
06/24/2023

Improving End-to-End Neural Diarization Using Conversational Summary Representations

Speaker diarization is a task concerned with partitioning an audio recor...
research
06/20/2021

Encoder-Decoder Based Attractor Calculation for End-to-End Neural Diarization

This paper investigates an end-to-end neural diarization (EEND) method f...

Please sign up or login with your details

Forgot password? Click here to reset