SNRi Target Training for Joint Speech Enhancement and Recognition

11/01/2021
by   Yuma Koizumi, et al.
0

This study aims to improve the performance of automatic speech recognition (ASR) under noisy conditions. The use of a speech enhancement (SE) frontend has been widely studied for noise robust ASR. However, most single-channel SE models introduce processing artifacts in the enhanced speech resulting in degraded ASR performance. To overcome this problem, we propose Signal-to-Noise Ratio improvement (SNRi) target training; the SE frontend automatically controls its noise reduction level to avoid degrading the ASR performance due to artifacts. The SE frontend uses an auxiliary scalar input which represents the target SNRi of the output signal. The target SNRi value is estimated by the SNRi prediction network, which is trained to minimize the ASR loss. Experiments using 55,027 hours of noisy speech training data show that SNRi target training enables control of the SNRi of the output signal, and the joint training reduces word error rate by 12 ASR model.

READ FULL TEXT
research
08/24/2023

Naaloss: Rethinking the objective of speech enhancement

Reducing noise interference is crucial for automatic speech recognition ...
research
03/09/2020

Improving noise robust automatic speech recognition with single-channel time-domain enhancement network

With the advent of deep learning, research on noise-robust automatic spe...
research
01/18/2022

How Bad Are Artifacts?: Analyzing the Impact of Speech Enhancement Errors on ASR

It is challenging to improve automatic speech recognition (ASR) performa...
research
02/13/2019

Enhanced Robot Speech Recognition Using Biomimetic Binaural Sound Source Localization

Inspired by the behavior of humans talking in noisy environments, we pro...
research
06/02/2021

Should We Always Separate?: Switching Between Enhanced and Observed Signals for Overlapping Speech Recognition

Although recent advances in deep learning technology improved automatic ...
research
11/15/2020

Speech enhancement guided by contextual articulatory information

Previous studies have confirmed the effectiveness of leveraging articula...
research
01/11/2022

Learning to Enhance or Not: Neural Network-Based Switching of Enhanced and Observed Signals for Overlapping Speech Recognition

The combination of a deep neural network (DNN) -based speech enhancement...

Please sign up or login with your details

Forgot password? Click here to reset