Deep Xi as a Front-End for Robust Automatic Speech Recognition
Front-end techniques for robust automatic speech recognition (ASR) have been dominated by masking- and mapping-based deep learning approaches to speech enhancement. Previously, minimum mean-square error (MMSE) approaches to speech enhancement using Deep Xi (a deep learning approach to a priori SNR estimation) were able to achieve higher quality and intelligibility scores than recent masking- and mapping-based deep learning approaches. Due to its high speech enhancement performance, we investigate the use of Deep Xi as a front-end for robust ASR. Deep Xi is evaluated using real-world non-stationary and coloured noise sources, at multiple SNR levels. Deep Xi achieved a relative word error rate reduction of 23.2 deep learning-based front-end. The results presented in this work show that Deep Xi is a viable front-end, and is able to significantly increase the robustness of an ASR system. Availability: Deep Xi is available at: https://github.com/anicolson/DeepXi
READ FULL TEXT