Simultaneous Speech Extraction for Multiple Target Speakers under the Meeting Scenarios(V1)

06/17/2022
by   Bang Zeng, et al.
0

Recently, the target speech separation or extraction techniques under the meeting scenario have become a hot research trend. We propose a speaker diarization aware multiple target speech separation system (SD-MTSS) to simultaneously extract the voice of each speaker from the mixed speech, rather than requiring a succession of independent processes as presented in previous solutions. SD-MTSS consists of a speaker diarization (SD) module and a multiple target speech separation (MTSS) module. The former one infers the target speaker voice activity detection (TSVAD) states of the mixture, as well as gets different speakers' single-talker audio segments as the reference speech. The latter one employs both the mixed audio and reference speech as inputs, and then it generates an estimated mask. By exploiting the TSVAD decision and the estimated mask, our SD-MTSS model can extract the speech of each speaker concurrently in a conversion recording without additional enrollment audio in advance.Experimental results show that our MTSS model outperforms our baselines with a large margin, achieving 1.38dB SDR, 1.34dB SI-SNR, and 0.13 PESQ improvements over the state-of-the-art SpEx+ baseline on the WSJ0-2mix-extr dataset, respectively. The SD-MTSS system makes a significant improvement than the baseline on the Alimeeting dataset as well.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/15/2023

Audio-Visual Active Speaker Extraction for Sparsely Overlapped Multi-talker Speech

Target speaker extraction aims to extract the speech of a specific speak...
research
09/19/2023

USED: Universal Speaker Extraction and Diarization

Speaker extraction and diarization are two crucial enabling techniques f...
research
05/17/2020

Multimodal Target Speech Separation with Voice and Face References

Target speech separation refers to isolating target speech from a multi-...
research
03/24/2019

Optimization of Speaker Extraction Neural Network with Magnitude and Temporal Spectrum Approximation Loss

The SpeakerBeam-FE (SBF) method is proposed for speaker extraction. It a...
research
10/12/2022

Individualized Conditioning and Negative Distances for Speaker Separation

Speaker separation aims to extract multiple voices from a mixed signal. ...
research
05/18/2020

A Thousand Words are Worth More Than One Recording: NLP Based Speaker Change Point Detection

Speaker Diarization (SD) consists of splitting or segmenting an input au...
research
02/29/2020

Voice Separation with an Unknown Number of Multiple Speakers

We present a new method for separating a mixed audio sequence, in which ...

Please sign up or login with your details

Forgot password? Click here to reset