Multi-Geometry Spatial Acoustic Modeling for Distant Speech Recognition

by   Kenichi Kumatani, et al.

The use of spatial information with multiple microphones can improve far-field automatic speech recognition (ASR) accuracy. However, conventional microphone array techniques degrade speech enhancement performance when there is an array geometry mismatch between design and test conditions. Moreover, such speech enhancement techniques do not always yield ASR accuracy improvement due to the difference between speech enhancement and ASR optimization objectives. In this work, we propose to unify an acoustic model framework by optimizing spatial filtering and long short-term memory (LSTM) layers from multi-channel (MC) input. Our acoustic model subsumes beamformers with multiple types of array geometry. In contrast to deep clustering methods that treat a neural network as a black box tool, the network encoding the spatial filters can process streaming audio data in real time without the accumulation of target signal statistics. We demonstrate the effectiveness of such MC neural networks through ASR experiments on the real-world far-field data. We show that our two-channel acoustic model can on average reduce word error rates (WERs) by 13.4 and 12.7 filter bank energy (LFBE) feature under the matched and mismatched microphone placement conditions, respectively. Our result also shows that our two-channel network achieves a relative WER reduction of over 7.0 beamforming with seven microphones overall.


page 1

page 2

page 3

page 4


Frequency Domain Multi-channel Acoustic Modeling for Distant Speech Recognition

Conventional far-field automatic speech recognition (ASR) systems typica...

Robust Multi-channel Speech Recognition using Frequency Aligned Network

Conventional speech enhancement technique such as beamforming has known ...

One model to enhance them all: array geometry agnostic multi-channel personalized speech enhancement

With the recent surge of video conferencing tools usage, providing high-...

Multi-channel Opus compression for far-field automatic speech recognition with a fixed bitrate budget

Automatic speech recognition (ASR) in the cloud allows the use of larger...

Combining Spatial Clustering with LSTM Speech Models for Multichannel Speech Enhancement

Recurrent neural networks using the LSTM architecture can achieve signif...

Localizing Spatial Information in Neural Spatiospectral Filters

Beamforming for multichannel speech enhancement relies on the estimation...

An Investigation into the Effectiveness of Enhancement in ASR Training and Test for CHiME-5 Dinner Party Transcription

Despite the strong modeling power of neural network acoustic models, spe...

Please sign up or login with your details

Forgot password? Click here to reset