End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction

04/26/2018
by   Zhong-Qiu Wang, et al.
0

This paper proposes an end-to-end approach for single-channel speaker-independent multi-speaker speech separation, where time-frequency (T-F) masking, the short-time Fourier transform (STFT), and its inverse are represented as layers within a deep network. Previous approaches, rather than computing a loss on the reconstructed signal, used a surrogate loss based on the target STFT magnitudes. This ignores reconstruction error introduced by phase inconsistency. In our approach, the loss function is directly defined on the reconstructed signals, which are optimized for best separation. In addition, we train through unfolded iterations of a phase reconstruction algorithm, represented as a series of STFT and inverse STFT layers. While mask values are typically limited to lie between zero and one for approaches using the mixture phase for reconstruction, this limitation is less relevant if the estimated magnitudes are to be used together with phase reconstruction. We thus propose several novel activation functions for the output layer of the T-F masking, to allow mask values beyond one. On the publicly-available wsj0-2mix dataset, our approach achieves state-of-the-art 12.6 dB scale-invariant signal-to-distortion ratio (SI-SDR) and 13.1 dB SDR, revealing new possibilities for deep learning based phase reconstruction and representing a fundamental progress towards solving the notoriously-hard cocktail party problem.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/22/2018

Deep Learning Based Phase Reconstruction for Speaker Separation: A Trigonometric Perspective

This study investigates phase reconstruction for deep learning based mon...
research
10/02/2018

Phasebook and Friends: Leveraging Discrete Representations for Source Separation

Deep learning based speech enhancement and source separation systems hav...
research
07/07/2016

Single-Channel Multi-Speaker Separation using Deep Clustering

Deep clustering is a recently introduced deep learning architecture that...
research
07/12/2017

Speaker-independent Speech Separation with Deep Attractor Network

Despite the recent success of deep learning for many speech processing t...
research
03/24/2019

Optimization of Speaker Extraction Neural Network with Magnitude and Temporal Spectrum Approximation Loss

The SpeakerBeam-FE (SBF) method is proposed for speaker extraction. It a...
research
11/04/2017

Monaural Singing Voice Separation with Skip-Filtering Connections and Recurrent Inference of Time-Frequency Mask

Singing voice separation based on deep learning relies on the usage of t...
research
04/07/2022

Declipping of Speech Signals Using Frequency Selective Extrapolation

The reconstruction of clipped speech signals is an important task in aud...

Please sign up or login with your details

Forgot password? Click here to reset