Attention-based multi-task learning for speech-enhancement and speaker-identification in multi-speaker dialogue scenario

by   Chiang-Jen Peng, et al.

Multi-task learning (MTL) and attention mechanism have been proven to effectively extract robust acoustic features for various speech-related tasks in noisy environments. In this study, we propose an attention-based MTL (ATM) approach that integrates MTL and the attention-weighting mechanism to simultaneously realize a multi-model learning structure that performs speech enhancement (SE) and speaker identification (SI). The proposed ATM system consists of three parts: SE, SI, and attention-Net (AttNet). The SE part is composed of a long-short-term memory (LSTM) model, and a deep neural network (DNN) model is used to develop the SI and AttNet parts. The overall ATM system first extracts the representative features and then enhances the speech signals in LSTM-SE and specifies speaker identity in DNN-SI. The AttNet computes weights based on DNN-SI to prepare better representative features for LSTM-SE. We tested the proposed ATM system on Taiwan Mandarin hearing in noise test sentences. The evaluation results confirmed that the proposed system can effectively enhance speech quality and intelligibility of a given noisy input. Moreover, the accuracy of the SI can also be notably improved by using the proposed ATM system.


page 1

page 2

page 3

page 4


Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention

This paper investigates a self-adaptation method for speech enhancement ...

Atss-Net: Target Speaker Separation via Attention-based Neural Network

Recently, Convolutional Neural Network (CNN) and Long short-term memory ...

End-to-End Speaker Height and age estimation using Attention Mechanism with LSTM-RNN

Automatic height and age estimation of speakers using acoustic features ...

Distributed Microphone Speech Enhancement based on Deep Learning

Speech-related applications deliver inferior performance in complex nois...

WaveCRN: An Efficient Convolutional Recurrent Neural Network for End-to-end Speech Enhancement

Due to the simple design pipeline, end-to-end (E2E) neural models for sp...

A Unified Deep Learning Framework for Short-Duration Speaker Verification in Adverse Environments

Speaker verification (SV) has recently attracted considerable research i...

Towards Relevance and Sequence Modeling in Language Recognition

The task of automatic language identification (LID) involving multiple d...