Log In Sign Up

Improving RNN Transducer With Target Speaker Extraction and Neural Uncertainty Estimation

by   Jiatong Shi, et al.

Target-speaker speech recognition aims to recognize target-speaker speech from noisy environments with background noise and interfering speakers. This work presents a joint framework that combines time-domain target-speaker speech extraction and Recurrent Neural Network Transducer (RNN-T). To stabilize the joint-training, we propose a multi-stage training strategy that pre-trains and fine-tunes each module in the system before joint-training. Meanwhile, speaker identity and speech enhancement uncertainty measures are proposed to compensate for residual noise and artifacts from the target speech extraction module. Compared to a recognizer fine-tuned with a target speech extraction model, our experiments show that adding the neural uncertainty module significantly reduces 17 background noise. The multi-condition experiments indicate that our method can achieve 9 the performance in the clean condition.


Target Speaker Extraction for Overlapped Multi-Talker Speaker Verification

The performance of speaker verification degrades significantly when the ...

PL-EESR: Perceptual Loss Based END-TO-END Robust Speaker Representation Extraction

Speech enhancement aims to improve the perceptual quality of the speech ...

A Fully Convolutional Neural Network Approach to End-to-End Speech Enhancement

This paper will describe a novel approach to the cocktail party problem ...

Handling Background Noise in Neural Speech Generation

Recent advances in neural-network based generative modeling of speech ha...

Joint AEC AND Beamforming with Double-Talk Detection using RNN-Transformer

Acoustic echo cancellation (AEC) is a technique used in full-duplex comm...

A Hybrid Continuity Loss to Reduce Over-Suppression for Time-domain Target Speaker Extraction

Speaker extraction algorithm extracts the target speech from a mixture s...

Streaming Noise Context Aware Enhancement For Automatic Speech Recognition in Multi-Talker Environments

One of the most challenging scenarios for smart speakers is multi-talker...