Single microphone speaker extraction using unified time-frequency Siamese-Unet

03/06/2022
by   Aviad Eisenberg, et al.
0

In this paper we present a unified time-frequency method for speaker extraction in clean and noisy conditions. Given a mixed signal, along with a reference signal, the common approaches for extracting the desired speaker are either applied in the time-domain or in the frequency-domain. In our approach, we propose a Siamese-Unet architecture that uses both representations. The Siamese encoders are applied in the frequency-domain to infer the embedding of the noisy and reference spectra, respectively. The concatenated representations are then fed into the decoder to estimate the real and imaginary components of the desired speaker, which are then inverse-transformed to the time-domain. The model is trained with the Scale-Invariant Signal-to-Distortion Ratio (SI-SDR) loss to exploit the time-domain information. The time-domain loss is also regularized with frequency-domain loss to preserve the speech patterns. Experimental results demonstrate that the unified approach is not only very easy to train, but also provides superior results as compared with state-of-the-art (SOTA) Blind Source Separation (BSS) methods, as well as commonly used speaker extraction approach.

READ FULL TEXT
research
05/10/2020

SpEx+: A Complete Time Domain Speaker Extraction Network

Speaker extraction aims to extract the target speech signal from a multi...
research
11/09/2020

Guided Source Separation

State-of-the-art separation of desired signal components from a mixture ...
research
03/13/2023

A two-stage speaker extraction algorithm under adverse acoustic conditions using a single-microphone

In this work, we present a two-stage method for speaker extraction under...
research
04/17/2020

SpEx: Multi-Scale Time Domain Speaker Extraction Network

Speaker extraction aims to mimic humans' selective auditory attention by...
research
04/29/2020

Time-domain speaker extraction network

Speaker extraction is to extract a target speaker's voice from multi-tal...
research
09/08/2022

TF-GridNet: Making Time-Frequency Domain Models Great Again for Monaural Speaker Separation

We propose TF-GridNet, a novel multi-path deep neural network (DNN) oper...
research
03/18/2018

Directional emphasis in ambisonics

We describe an ambisonics enhancement method that increases the signal s...

Please sign up or login with your details

Forgot password? Click here to reset