DeepAI AI Chat
Log In Sign Up

Unsupervised Speaker Adaptation using Attention-based Speaker Memory for End-to-End ASR

by   Leda Sarı, et al.

We propose an unsupervised speaker adaptation method inspired by the neural Turing machine for end-to-end (E2E) automatic speech recognition (ASR). The proposed model contains a memory block that holds speaker i-vectors extracted from the training data and reads relevant i-vectors from the memory through an attention mechanism. The resulting memory vector (M-vector) is concatenated to the acoustic features or to the hidden layer activations of an E2E neural network model. The E2E ASR system is based on the joint connectionist temporal classification and attention-based encoder-decoder architecture. M-vector and i-vector results are compared for inserting them at different layers of the encoder neural network using the WSJ and TED-LIUM2 ASR benchmarks. We show that M-vectors, which do not require an auxiliary speaker embedding extraction system at test time, achieve similar word error rates (WERs) compared to i-vectors for single speaker utterances and significantly lower WERs for utterances in which there are speaker changes.


page 1

page 2

page 3

page 4


Back-Translation-Style Data Augmentation for End-to-End ASR

In this paper we propose a novel data augmentation method for attention-...

Convolutional Attention-based Seq2Seq Neural Network for End-to-End ASR

This thesis introduces the sequence to sequence model with Luong's atten...

An End-to-End Text-independent Speaker Verification Framework with a Keyword Adversarial Network

This paper presents an end-to-end text-independent speaker verification ...

Speaker Adaptation for Attention-Based End-to-End Speech Recognition

We propose three regularization-based speaker adaptation approaches to a...

Advancing Connectionist Temporal Classification With Attention Modeling

In this study, we propose advancing all-neural speech recognition by dir...

An Attention-Based Speaker Naming Method for Online Adaptation in Non-Fixed Scenarios

A speaker naming task, which finds and identifies the active speaker in ...

Improving And Analyzing Neural Speaker Embeddings for ASR

Neural speaker embeddings encode the speaker's speech characteristics th...