Analyzing Learned Representations of a Deep ASR Performance Prediction Model

08/26/2018
by   Zied Elloumi, et al.
0

This paper addresses a relatively new task: prediction of ASR performance on unseen broadcast programs. In a previous paper, we presented an ASR performance prediction system using CNNs that encode both text (ASR transcript) and speech, in order to predict word error rate. This work is dedicated to the analysis of speech signal embeddings and text embeddings learnt by the CNN while training our prediction model. We try to better understand which information is captured by the deep model and its relation with different conditioning factors. It is shown that hidden layers convey a clear signal about speech style, accent and broadcast type. We then try to leverage these 3 types of information at training time through multi-task learning. Our experiments show that this allows to train slightly more efficient ASR performance prediction systems that - in addition - simultaneously tag the analyzed utterances according to their speech style, accent and broadcast program origin.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/23/2018

ASR Performance Prediction on Unseen Broadcast Programs using Convolutional Neural Networks

In this paper, we address a relatively new task: prediction of ASR perfo...
research
10/18/2022

Maestro-U: Leveraging joint speech-text representation learning for zero supervised speech ASR

Training state-of-the-art Automated Speech Recognition (ASR) models typi...
research
04/07/2022

MAESTRO: Matched Speech Text Representations through Modality Matching

We present Maestro, a self-supervised training method to unify represent...
research
04/13/2020

Punctuation Prediction in Spontaneous Conversations: Can We Mitigate ASR Errors with Retrofitted Word Embeddings?

Automatic Speech Recognition (ASR) systems introduce word errors, which ...
research
04/07/2022

MTI-Net: A Multi-Target Speech Intelligibility Prediction Model

Recently, deep learning (DL)-based non-intrusive speech assessment model...
research
08/04/2021

Information Sieve: Content Leakage Reduction in End-to-End Prosody For Expressive Speech Synthesis

Expressive neural text-to-speech (TTS) systems incorporate a style encod...
research
04/08/2020

The Spotify Podcasts Dataset

Podcasts are a relatively new form of audio media. Episodes appear on a ...

Please sign up or login with your details

Forgot password? Click here to reset