ASR-Aware End-to-end Neural Diarization

02/02/2022
by   Aparna Khare, et al.
0

We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model. Two categories of features are explored: features derived directly from ASR output (phones, position-in-word and word boundaries) and features derived from a lexical speaker change detection model, trained by fine-tuning a pretrained BERT model on the ASR output. Three modifications to the Conformer-based EEND architecture are proposed to incorporate the features. First, ASR features are concatenated with acoustic features. Second, we propose a new attention mechanism called contextualized self-attention that utilizes ASR features to build robust speaker representations. Finally, multi-task learning is used to train the model to minimize classification loss for the ASR features along with diarization loss. Experiments on the two-speaker English conversations of Switchboard+SRE data sets show that multi-task learning with position-in-word information is the most effective way of utilizing ASR features, reducing the diarization error rate (DER) by 20 baseline.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/31/2020

Modular End-to-end Automatic Speech Recognition Framework for Acoustic-to-word Model

End-to-end (E2E) systems have played a more and more important role in a...
research
10/27/2022

Automatic Severity Assessment of Dysarthric speech by using Self-supervised Model with Multi-task Learning

Automatic assessment of dysarthric speech is essential for sustained tre...
research
09/15/2023

Towards Word-Level End-to-End Neural Speaker Diarization with Auxiliary Network

While standard speaker diarization attempts to answer the question "who ...
research
08/08/2020

Word Error Rate Estimation Without ASR Output: e-WER2

Measuring the performance of automatic speech recognition (ASR) systems ...
research
05/23/2023

BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR

The recently proposed serialized output training (SOT) simplifies multi-...
research
06/29/2021

Alzheimer's Dementia Recognition Using Acoustic, Lexical, Disfluency and Speech Pause Features Robust to Noisy Inputs

We present two multimodal fusion-based deep learning models that consume...
research
05/16/2020

Speech Recognition and Multi-Speaker Diarization of Long Conversations

Speech recognition (ASR) and speaker diarization (SD) models have tradit...

Please sign up or login with your details

Forgot password? Click here to reset