Should We Always Separate?: Switching Between Enhanced and Observed Signals for Overlapping Speech Recognition

06/02/2021
by   Hiroshi Sato, et al.
0

Although recent advances in deep learning technology improved automatic speech recognition (ASR), it remains difficult to recognize speech when it overlaps other people's voices. Speech separation or extraction is often used as a front-end to ASR to handle such overlapping speech. However, deep neural network-based speech enhancement can generate `processing artifacts' as a side effect of the enhancement, which degrades ASR performance. For example, it is well known that single-channel noise reduction for non-speech noise (non-overlapping speech) often does not improve ASR. Likewise, the processing artifacts may also be detrimental to ASR in some conditions when processing overlapping speech with a separation/extraction method, although it is usually believed that separation/extraction improves ASR. In order to answer the question `Do we always have to separate/extract speech from mixtures?', we analyze ASR performance on observed and enhanced speech at various noise and interference conditions, and show that speech enhancement degrades ASR under some conditions even for overlapping speech. Based on these findings, we propose a simple switching algorithm between observed and enhanced speech based on the estimated signal-to-interference ratio and signal-to-noise ratio. We demonstrated experimentally that such a simple switching mechanism can improve recognition performance when processing artifacts are detrimental to ASR.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/11/2022

Learning to Enhance or Not: Neural Network-Based Switching of Enhanced and Observed Signals for Overlapping Speech Recognition

The combination of a deep neural network (DNN) -based speech enhancement...
research
11/01/2021

SNRi Target Training for Joint Speech Enhancement and Recognition

This study aims to improve the performance of automatic speech recogniti...
research
10/20/2022

Anchored Speech Recognition with Neural Transducers

Neural transducers have gained popularity in production ASR systems, ach...
research
01/18/2022

How Bad Are Artifacts?: Analyzing the Impact of Speech Enhancement Errors on ASR

It is challenging to improve automatic speech recognition (ASR) performa...
research
04/25/2022

Cleanformer: A microphone array configuration-invariant, streaming, multichannel neural enhancement frontend for ASR

This work introduces the Cleanformer, a streaming multichannel neural ba...
research
02/13/2019

Enhanced Robot Speech Recognition Using Biomimetic Binaural Sound Source Localization

Inspired by the behavior of humans talking in noisy environments, we pro...
research
11/06/2018

SDR - half-baked or well done?

In speech enhancement and source separation, signal-to-noise ratio is a ...

Please sign up or login with your details

Forgot password? Click here to reset