Enhancement and Recognition of Reverberant and Noisy Speech by Extending Its Coherence

by   Scott Wisdom, et al.

Most speech enhancement algorithms make use of the short-time Fourier transform (STFT), which is a simple and flexible time-frequency decomposition that estimates the short-time spectrum of a signal. However, the duration of short STFT frames are inherently limited by the nonstationarity of speech signals. The main contribution of this paper is a demonstration of speech enhancement and automatic speech recognition in the presence of reverberation and noise by extending the length of analysis windows. We accomplish this extension by performing enhancement in the short-time fan-chirp transform (STFChT) domain, an overcomplete time-frequency representation that is coherent with speech signals over longer analysis window durations than the STFT. This extended coherence is gained by using a linear model of fundamental frequency variation of voiced speech signals. Our approach centers around using a single-channel minimum mean-square error log-spectral amplitude (MMSE-LSA) estimator proposed by Habets, which scales coefficients in a time-frequency domain to suppress noise and reverberation. In the case of multiple microphones, we preprocess the data with either a minimum variance distortionless response (MVDR) beamformer, or a delay-and-sum beamformer (DSB). We evaluate our algorithm on both speech enhancement and recognition tasks for the REVERB challenge dataset. Compared to the same processing done in the STFT domain, our approach achieves significant improvement in terms of objective enhancement metrics (including PESQ---the ITU-T standard measurement for speech quality). In terms of automatic speech recognition (ASR) performance as measured by word error rate (WER), our experiments indicate that the STFT with a long window is more effective for ASR.


page 10

page 14


Trainable Adaptive Window Switching for Speech Enhancement

This study proposes a trainable adaptive window switching (AWS) method a...

Deep Xi as a Front-End for Robust Automatic Speech Recognition

Front-end techniques for robust automatic speech recognition (ASR) have ...

Cross-domain Single-channel Speech Enhancement Model with Bi-projection Fusion Module for Noise-robust ASR

In recent decades, many studies have suggested that phase information is...

Model-based estimation of in-car-communication feedback applied to speech zone detection

Modern cars provide versatile tools to enhance speech communication. Whi...

Improving spatial cues for hearables using a parameterized binaural CDR estimator

We investigate a speech enhancement method based on the binaural coheren...

Cleanformer: A microphone array configuration-invariant, streaming, multichannel neural enhancement frontend for ASR

This work introduces the Cleanformer, a streaming multichannel neural ba...

Real-Time Speech Enhancement Using Spectral Subtraction with Minimum Statistics and Spectral Floor

An initial real-time speech enhancement method is presented to reduce th...

Please sign up or login with your details

Forgot password? Click here to reset