UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-determined Training Mixtures

05/31/2023
by   Zhong-Qiu Wang, et al.
0

In reverberant conditions with multiple concurrent speakers, each microphone acquires a mixture signal of multiple speakers at a different location. In over-determined conditions where the microphones out-number speakers, we can narrow down the solutions to speaker images and realize unsupervised speech separation by leveraging each mixture signal as a constraint (i.e., the estimated speaker images at a microphone should add up to the mixture). Equipped with this insight, we propose UNSSOR, an algorithm for unsupervised neural speech separation by leveraging over-determined training mixtures. At each training step, we feed an input mixture to a deep neural network (DNN) to produce an intermediate estimate for each speaker, linearly filter the estimates, and optimize a loss so that, at each microphone, the filtered estimates of all the speakers can add up to the mixture to satisfy the above constraint. We show that this loss can promote unsupervised separation of speakers. The linear filters are computed in each sub-band based on the mixture and DNN estimates through the forward convolutive prediction (FCP) algorithm. To address the frequency permutation problem incurred by using sub-band FCP, a loss term based on minimizing intra-source magnitude scattering is proposed. Although UNSSOR requires over-determined training mixtures, we can train DNNs to achieve under-determined separation (e.g., unsupervised monaural speech separation). Evaluation results on two-speaker separation in reverberant conditions show the effectiveness and potential of UNSSOR.

READ FULL TEXT

page 13

page 14

research
04/05/2019

Recursive speech separation for unknown number of speakers

In this paper we propose a method of single-channel speaker-independent ...
research
12/12/2017

Classification vs. Regression in Supervised Learning for Single Channel Speaker Count Estimation

The task of estimating the maximum number of concurrent speakers from si...
research
09/08/2022

TF-GridNet: Making Time-Frequency Domain Models Great Again for Monaural Speaker Separation

We propose TF-GridNet, a novel multi-path deep neural network (DNN) oper...
research
06/29/2023

Modified Parametric Multichannel Wiener Filter for Low-latency Enhancement of Speech Mixtures with Unknown Number of Speakers

This paper introduces a novel low-latency online beamforming (BF) algori...
research
04/24/2023

Multi-channel Speech Separation Using Spatially Selective Deep Non-linear Filters

In a multi-channel separation task with multiple speakers, we aim to rec...
research
11/11/2020

Surrogate Source Model Learning for Determined Source Separation

We propose to learn surrogate functions of universal speech priors for d...
research
04/04/2022

An Initialization Scheme for Meeting Separation with Spatial Mixture Models

Spatial mixture model (SMM) supported acoustic beamforming has been exte...

Please sign up or login with your details

Forgot password? Click here to reset