EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers

03/31/2022
by   Yushi Ueda, et al.
0

In this paper, we present a novel framework that jointly performs speaker diarization, speech separation, and speaker counting. Our proposed method combines end-to-end speaker diarization and speech separation methods, namely, End-to-End Neural Speaker Diarization with Encoder-Decoder-based Attractor calculation (EEND-EDA) and the Convolutional Time-domain Audio Separation Network (ConvTasNet) as multi-tasking joint model. We also propose the multiple 1x1 convolutional layer architecture for estimating the separation masks corresponding to the number of speakers, and a post-processing technique for refining the separated speech signal with speech activity. Experiments using LibriMix dataset show that our proposed method outperforms the baselines in terms of diarization and separation performance for both fixed and flexible numbers of speakers, as well as speaker counting performance for flexible numbers of speakers. All materials will be open-sourced and reproducible in ESPnet toolkit.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/20/2020

End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors

End-to-end speaker diarization for an unknown number of speakers is addr...
research
07/01/2020

Exploring the time-domain deep attractor network with two-stream architectures in a reverberant environment

With the success of deep learning in speech signal processing, speaker-i...
research
12/18/2020

End-to-End Speaker Diarization as Post-Processing

This paper investigates the utilization of an end-to-end diarization mod...
research
03/17/2020

High-Resolution Speaker Counting In Reverberant Rooms Using CRNN With Ambisonics Features

Speaker counting is the task of estimating the number of people that are...
research
03/17/2020

Deep Attention Fusion Feature for Speech Separation with End-to-End Post-filter Method

In this paper, we propose an end-to-end post-filter method with deep att...
research
01/06/2021

Multichannel CRNN for Speaker Counting: an Analysis of Performance

Speaker counting is the task of estimating the number of people that are...
research
11/29/2020

A comparison of handcrafted, parameterized, and learnable features for speech separation

The design of acoustic features is important for speech separation. It c...

Please sign up or login with your details

Forgot password? Click here to reset