TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings

03/07/2023
by   Christoph Boeddeker, et al.
0

Since diarization and source separation of meeting data are closely related tasks, we here propose an approach to perform the two objectives jointly. It builds upon the target-speaker voice activity detection (TS-VAD) diarization approach, which assumes that initial speaker embeddings are available. We replace the final combined speaker activity estimation network of TS-VAD with a network that produces speaker activity estimates at a time-frequency resolution. Those act as masks for source extraction, either via masking or via beamforming. The technique can be applied both for single-channel and multi-channel input and, in both cases, achieves a new state-of-the-art word error rate (WER) on the LibriCSS meeting data recognition task. We further compute speaker-aware and speaker-agnostic WERs to isolate the contribution of diarization errors to the overall WER performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/11/2018

VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking

In this paper, we present a novel system that separates the voice of a t...
research
10/28/2022

Target-Speaker Voice Activity Detection via Sequence-to-Sequence Prediction

Target-speaker voice activity detection is currently a promising approac...
research
05/14/2020

Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario

Speaker diarization for real-life scenarios is an extremely challenging ...
research
05/28/2021

DIVE: End-to-end Speech Diarization via Iterative Speaker Embedding

We introduce DIVE, an end-to-end speaker diarization algorithm. Our neur...
research
06/05/2021

Lightweight Dual-channel Target Speaker Separation for Mobile Voice Communication

Nowadays, there is a strong need to deploy the target speaker separation...
research
10/30/2019

SMS-WSJ: Database, performance measures, and baseline recipe for multi-channel source separation and recognition

We present a multi-channel database of overlapping speech for training, ...
research
04/04/2022

Target Confusion in End-to-end Speaker Extraction: Analysis and Approaches

Recently, end-to-end speaker extraction has attracted increasing attenti...

Please sign up or login with your details

Forgot password? Click here to reset