What do End-to-End Speech Models Learn about Speaker, Language and Channel Information? A Layer-wise and Neuron-level Analysis

07/01/2021
by   Shammur Absar Chowdhury, et al.
0

End-to-end DNN architectures have pushed the state-of-the-art in speech technologies, as well as in other spheres of AI, leading researchers to train more complex and deeper models. These improvements came at the cost of transparency. DNNs are innately opaque and difficult to interpret. We no longer understand what features are learned, where they are preserved, and how they inter-operate. Such an analysis is important for better model understanding, debugging and to ensure fairness in ethical decision making. In this work, we analyze the representations trained within deep speech models, towards the task of speaker recognition, dialect identification and reconstruction of masked signals. We carry a layer- and neuron-level analysis on the utterance-level representations captured within pretrained speech models for speaker, language and channel properties. We study: is this information captured in the learned representations? where is it preserved? how is it distributed? and can we identify a minimal subset of network that posses this information. Using diagnostic classifiers, we answered these questions. Our results reveal: (i) channel and gender information is omnipresent and is redundantly distributed (ii) complex properties such as dialectal information is encoded only in the task-oriented pretrained network and is localised in the upper layers (iii) a minimal subset of neurons can be extracted to encode the predefined property (iv) salient neurons are sometimes shared between properties and can highlights presence of biases in the network. Our cross-architectural comparison indicates that (v) the pretrained models captures speaker-invariant information and (vi) the pretrained CNNs models are competitive to the Transformers for encoding information for the studied properties. To the best of our knowledge, this is the first study to investigate neuron analysis on the speech models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/27/2022

Linguistic Correlation Analysis: Discovering Salient Neurons in deepNLP models

While a lot of work has been done in understanding representations learn...
research
06/30/2019

Contextual Phonetic Pretraining for End-to-end Utterance-level Language and Speaker Recognition

Pretrained contextual word representations in NLP have greatly improved ...
research
10/06/2020

Analyzing Individual Neurons in Pre-trained Language Models

While a lot of analysis has been carried to demonstrate linguistic knowl...
research
01/19/2022

Interpreting Arabic Transformer Models

Arabic is a Semitic language which is widely spoken with many dialects. ...
research
11/04/2019

What does a network layer hear? Analyzing hidden representations of end-to-end ASR through speech synthesis

End-to-end speech recognition systems have achieved competitive results ...
research
03/03/2020

Untangling in Invariant Speech Recognition

Encouraged by the success of deep neural networks on a variety of visual...
research
11/09/2020

Language Through a Prism: A Spectral Approach for Multiscale Language Representations

Language exhibits structure at different scales, ranging from subwords t...

Please sign up or login with your details

Forgot password? Click here to reset