Single-Channel Multi-talker Speech Recognition with Permutation Invariant Training

07/19/2017
by   Yanmin Qian, et al.
0

Although great progresses have been made in automatic speech recognition (ASR), significant performance degradation is still observed when recognizing multi-talker mixed speech. In this paper, we propose and evaluate several architectures to address this problem under the assumption that only a single channel of mixed signal is available. Our technique extends permutation invariant training (PIT) by introducing the front-end feature separation module with the minimum mean square error (MSE) criterion and the back-end recognition module with the minimum cross entropy (CE) criterion. More specifically, during training we compute the average MSE or CE over the whole utterance for each possible utterance-level output-target assignment, pick the one with the minimum MSE or CE, and optimize for that assignment. This strategy elegantly solves the label permutation problem observed in the deep learning based multi-talker mixed speech separation and recognition systems. The proposed architectures are evaluated and compared on an artificially mixed AMI dataset with both two- and three-talker mixed speech. The experimental results indicate that our proposed architectures can cut the word error rate (WER) by 45.0 25.0 system across all speakers when their energies are comparable, for two- and three-talker mixed speech, respectively. To our knowledge, this is the first work on the multi-talker mixed speech recognition on the challenging speaker-independent spontaneous large vocabulary continuous speech task.

READ FULL TEXT
research
07/05/2021

Investigation of Practical Aspects of Single Channel Speech Separation for ASR

Speech separation has been successfully applied as a frontend processing...
research
11/22/2021

Multi-Channel Multi-Speaker ASR Using 3D Spatial Feature

Automatic speech recognition (ASR) of multi-channel multi-speaker overla...
research
01/16/2023

Multi-resolution location-based training for multi-channel continuous speech separation

The performance of automatic speech recognition (ASR) systems severely d...
research
05/09/2019

Analysis of Deep Clustering as Preprocessing for Automatic Speech Recognition of Sparsely Overlapping Speech

Significant performance degradation of automatic speech recognition (ASR...
research
10/19/2016

A Bayesian Approach to Estimation of Speaker Normalization Parameters

In this work, a Bayesian approach to speaker normalization is proposed t...
research
11/29/2021

Mixed Precision DNN Qunatization for Overlapped Speech Separation and Recognition

Recognition of overlapped speech has been a highly challenging task to d...
research
11/16/2021

Single-channel speech separation using Soft-minimum Permutation Invariant Training

The goal of speech separation is to extract multiple speech sources from...

Please sign up or login with your details

Forgot password? Click here to reset