Assessing the Generalization Gap of Learning-Based Speech Enhancement Systems in Noisy and Reverberant Environments

09/12/2023
by   Philippe Gonzalez, et al.
0

The acoustic variability of noisy and reverberant speech mixtures is influenced by multiple factors, such as the spectro-temporal characteristics of the target speaker and the interfering noise, the signal-to-noise ratio (SNR) and the room characteristics. This large variability poses a major challenge for learning-based speech enhancement systems, since a mismatch between the training and testing conditions can substantially reduce the performance of the system. Generalization to unseen conditions is typically assessed by testing the system with a new speech, noise or binaural room impulse response (BRIR) database different from the one used during training. However, the difficulty of the speech enhancement task can change across databases, which can substantially influence the results. The present study introduces a generalization assessment framework that uses a reference model trained on the test condition, such that it can be used as a proxy for the difficulty of the test condition. This allows to disentangle the effect of the change in task difficulty from the effect of dealing with new data, and thus to define a new measure of generalization performance termed the generalization gap. The procedure is repeated in a cross-validation fashion by cycling through multiple speech, noise, and BRIR databases to accurately estimate the generalization gap. The proposed framework is applied to evaluate the generalization potential of a feedforward neural network (FFNN), Conv-TasNet, DCCRN and MANNER. We find that for all models, the performance degrades the most in speech mismatches, while good noise and room generalization can be achieved by training on multiple databases. Moreover, while recent models show higher performance in matched conditions, their performance substantially decreases in mismatched conditions and can become inferior to that of the FFNN-based system.

READ FULL TEXT
research
11/15/2018

Effects of Lombard Reflex on the Performance of Deep-Learning-Based Audio-Visual Speech Enhancement Systems

Humans tend to change their way of speaking when they are immersed in a ...
research
05/26/2021

Self-attending RNN for Speech Enhancement to Improve Cross-corpus Generalization

Deep neural networks (DNNs) represent the mainstream methodology for sup...
research
08/26/2021

A Deep Learning Loss Function based on Auditory Power Compression for Speech Enhancement

Deep learning technology has been widely applied to speech enhancement. ...
research
09/07/2017

Normalized Features for Improving the Generalization of DNN Based Speech Enhancement

Enhancing noisy speech is an important task to restore its quality and t...
research
09/15/2017

Supervised and Unsupervised Speech Enhancement Using Nonnegative Matrix Factorization

Reducing the interference noise in a monaural noisy speech signal has be...
research
01/25/2023

On Batching Variable Size Inputs for Training End-to-End Speech Enhancement Systems

The performance of neural network-based speech enhancement systems is pr...
research
08/02/2021

Robust Acoustic Scene Classification in the Presence of Active Foreground Speech

We present an iVector based Acoustic Scene Classification (ASC) system s...

Please sign up or login with your details

Forgot password? Click here to reset