Cleanformer: A microphone array configuration-invariant, streaming, multichannel neural enhancement frontend for ASR

04/25/2022
by   Joseph Caroselli, et al.
0

This work introduces the Cleanformer, a streaming multichannel neural based enhancement frontend for automatic speech recognition (ASR). This model has a conformer-based architecture which takes as inputs a single channel each of raw and enhanced signals, and uses self-attention to derive a time-frequency mask. The enhanced input is generated by a multichannel adaptive noise cancellation algorithm known as Speech Cleaner, which makes use of noise context to derive its filter taps. The time-frequency mask is applied to the noisy input to produce enhanced output features for ASR. Detailed evaluations are presented with simulated and re-recorded datasets in speech-based and non-speech-based noise that show significant reduction in word error rate (WER) when using a large-scale state-of-the-art ASR model. It also will be shown to significantly outperform enhancement using a beamformer with ideal steering. The enhancement model is agnostic of the number of microphones and array configuration and, therefore, can be used with different microphone arrays without the need for retraining. It is demonstrated that performance improves with more microphones, up to 4, with each additional microphone providing a smaller marginal benefit. Specifically, for an SNR of -6dB, relative WER improvements of about 80% are shown in both noise conditions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/26/2022

Mask scalar prediction for improving robust automatic speech recognition

Using neural network based acoustic frontends for improving robustness o...
research
11/18/2021

A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement and Speech Separation

We present a frontend for improving robustness of automatic speech recog...
research
06/02/2021

Should We Always Separate?: Switching Between Enhanced and Observed Signals for Overlapping Speech Recognition

Although recent advances in deep learning technology improved automatic ...
research
09/02/2015

Enhancement and Recognition of Reverberant and Noisy Speech by Extending Its Coherence

Most speech enhancement algorithms make use of the short-time Fourier tr...
research
07/26/2019

Correlation Distance Skip Connection Denoising Autoencoder (CDSK-DAE) for Speech Feature Enhancement

Performance of learning based Automatic Speech Recognition (ASR) is susc...
research
04/25/2022

Understanding Audio Features via Trainable Basis Functions

In this paper we explore the possibility of maximizing the information r...
research
12/02/2020

Enhancement of Spatial Clustering-Based Time-Frequency Masks using LSTM Neural Networks

Recent works have shown that Deep Recurrent Neural Networks using the LS...

Please sign up or login with your details

Forgot password? Click here to reset