STFT-Domain Neural Speech Enhancement with Very Low Algorithmic Latency

04/21/2022
by   Zhong-Qiu Wang, et al.
0

Deep learning based speech enhancement in the short-term Fourier transform (STFT) domain typically uses a large window length such as 32 ms. A larger window contains more samples and the frequency resolution can be higher for potentially better enhancement. This however incurs an algorithmic latency of 32 ms in an online setup, because the overlap-add algorithm used in the inverse STFT (iSTFT) is also performed based on the same 32 ms window size. To reduce this inherent latency, we adapt a conventional dual window size approach, where a regular input window size is used for STFT but a shorter output window is used for the overlap-add in the iSTFT, for STFT-domain deep learning based frame-online speech enhancement. Based on this STFT and iSTFT configuration, we employ single- or multi-microphone complex spectral mapping for frame-online enhancement, where a deep neural network (DNN) is trained to predict the real and imaginary (RI) components of target speech from the mixture RI components. In addition, we use the RI components predicted by the DNN to conduct frame-online beamforming, the results of which are then used as extra features for a second DNN to perform frame-online post-filtering. The frequency-domain beamforming in between the two DNNs can be easily integrated with complex spectral mapping and is designed to not incur any algorithmic latency. Additionally, we propose a future-frame prediction technique to further reduce the algorithmic latency. Evaluation results on a noisy-reverberant speech enhancement task demonstrate the effectiveness of the proposed algorithms. Compared with Conv-TasNet, our STFT-domain system can achieve better enhancement performance for a comparable amount of computation, or comparable performance with less computation, maintaining strong performance at an algorithmic latency as low as 2 ms.

READ FULL TEXT

page 1

page 2

research
04/15/2022

Improving Frame-Online Neural Speech Enhancement with Overlapped-Frame Prediction

Frame-online speech enhancement systems in the short-time Fourier transf...
research
02/24/2022

Towards Low-distortion Multi-channel Speech Enhancement: The ESPNet-SE Submission to The L3DAS22 Challenge

This paper describes our submission to the L3DAS22 Challenge Task 1, whi...
research
10/01/2021

Leveraging Low-Distortion Target Estimates for Improved Speech Enhancement

A promising approach for multi-microphone speech separation involves two...
research
06/29/2016

Optimising The Input Window Alignment in CD-DNN Based Phoneme Recognition for Low Latency Processing

We present a systematic analysis on the performance of a phonetic recogn...
research
04/18/2023

Neural Speech Enhancement with Very Low Algorithmic Latency and Complexity via Integrated Full- and Sub-Band Modeling

We propose FSB-LSTM, a novel long short-term memory (LSTM) based archite...
research
03/04/2020

Multi-Microphone Complex Spectral Mapping for Speech Dereverberation

This study proposes a multi-microphone complex spectral mapping approach...
research
05/09/2019

Block-Online Multi-Channel Speech Enhancement Using DNN-Supported Relative Transfer Function Estimates

This paper addresses the problem of block-online processing for multi-ch...

Please sign up or login with your details

Forgot password? Click here to reset