Progressive Joint Modeling in Unsupervised Single-channel Overlapped Speech Recognition

07/21/2017
by   Zhehuai Chen, et al.
0

Unsupervised single-channel overlapped speech recognition is one of the hardest problems in automatic speech recognition (ASR). Permutation invariant training (PIT) is a state of the art model-based approach, which applies a single neural network to solve this single-input, multiple-output modeling problem. We propose to advance the current state of the art by imposing a modular structure on the neural network, applying a progressive pretraining regimen, and improving the objective function with transfer learning and a discriminative training criterion. The modular structure splits the problem into three sub-tasks: frame-wise interpreting, utterance-level speaker tracing, and speech recognition. The pretraining regimen uses these modules to solve progressively harder tasks. Transfer learning leverages parallel clean speech to improve the training targets for the network. Our discriminative training formulation is a modification of standard formulations, that also penalizes competing outputs of the system. Experiments are conducted on the artificial overlapped Switchboard and hub5e-swb dataset. The proposed framework achieves over 30 PIT for ASR, and a separately optimized system, PIT for speech separation with clean speech ASR model. The improvement comes from better model generalization, training efficiency and the sequence level linguistic knowledge integration.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/05/2021

Investigation of Practical Aspects of Single Channel Speech Separation for ASR

Speech separation has been successfully applied as a frontend processing...
research
08/13/2019

End-to-End Multi-Speaker Speech Recognition using Speaker Embeddings and Transfer Learning

This paper presents our latest investigation on end-to-end automatic spe...
research
08/06/2020

A Transfer Learning Method for Speech Emotion Recognition from Automatic Speech Recognition

This paper presents a transfer learning method in speech emotion recogni...
research
11/01/2022

A Comparative Study on multichannel Speaker-attributed automatic speech recognition in Multi-party Meetings

Speaker-attributed automatic speech recognition (SA-ASR) in multiparty m...
research
02/07/2020

Unsupervised pretraining transfers well across languages

Cross-lingual and multi-lingual training of Automatic Speech Recognition...
research
05/13/2019

Almost Unsupervised Text to Speech and Automatic Speech Recognition

Text to speech (TTS) and automatic speech recognition (ASR) are two dual...
research
04/22/2022

Efficient Training of Neural Transducer for Speech Recognition

As one of the most popular sequence-to-sequence modeling approaches for ...

Please sign up or login with your details

Forgot password? Click here to reset