Analyzing Hidden Representations in End-to-End Automatic Speech Recognition Systems

09/13/2017
by   Yonatan Belinkov, et al.
0

Neural models have become ubiquitous in automatic speech recognition systems. While neural networks are typically used as acoustic models in more complex systems, recent studies have explored end-to-end speech recognition systems based on neural networks, which can be trained to directly predict text from input acoustic features. Although such systems are conceptually elegant and simpler than traditional systems, it is less obvious how to interpret the trained models. In this work, we analyze the speech representations learned by a deep end-to-end model that is based on convolutional and recurrent layers, and trained with a connectionist temporal classification (CTC) loss. We use a pre-trained model to generate frame-level features which are given to a classifier that is trained on frame classification into phones. We evaluate representations from different layers of the deep model and compare their quality for predicting phone labels. Our experiments shed light on important aspects of the end-to-end model such as layer depth, model complexity, and other design choices.

READ FULL TEXT

page 5

page 11

research
07/09/2019

Analyzing Phonetic and Graphemic Representations in End-to-End Automatic Speech Recognition

End-to-end neural network systems for automatic speech recognition (ASR)...
research
08/17/2022

Analyzing Robustness of End-to-End Neural Models for Automatic Speech Recognition

We investigate robustness properties of pre-trained neural models for au...
research
03/03/2020

Untangling in Invariant Speech Recognition

Encouraged by the success of deep neural networks on a variety of visual...
research
07/22/2021

CarneliNet: Neural Mixture Model for Automatic Speech Recognition

End-to-end automatic speech recognition systems have achieved great accu...
research
08/01/2017

End-to-End Neural Segmental Models for Speech Recognition

Segmental models are an alternative to frame-based models for sequence p...
research
05/25/2020

InfantNet: A Deep Neural Network for Analyzing Infant Vocalizations

Acoustic analyses of infant vocalizations are valuable for research on s...
research
05/30/2017

Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

Eliminating the negative effect of non-stationary environmental noise is...

Please sign up or login with your details

Forgot password? Click here to reset