Due to their complex non-linear nested structure, deep neural networks are often considered to be black boxes when it comes to analyzing the relationship between input data and network output. This is not only dissatisfying for scientists and engineers working with these models but also entirely unacceptable in domains where understanding and verification of predictions is crucial. Consequently, in health care applications where human verification is indispensable, these complex models are not in use [1, 2]. As a response, a recently emerging branch of machine learning research specifically targets the understanding of different aspects of complex models, including for example methods introspecting learned features [3, 4] and methods explaining model decisions [5, 6, 7, 8, 9]
. Latter ones were originally successfully applied to image classifiers and have more recently also been transferred to other domains such as natural language processing[10, 11], EEG analysis  or physics .
This paper explores and extends deep neural network interpretation to audio classification. Like the visual domain, deep neural networks have fostered progress in audio processing [14, 15, 16, 17], particularly in automatic speech recognition (ASR) [18, 19]. However, whereas large corpora of annotated speech data are available [20, 21, 22], there is a distinct lack of a simple raw waveform dataset for audio classification that can be used as first sandbox setting for testing novel model architectures and interpretation algorithms. In style of the MNIST dataset of handwritten digits 
, which has taken this role in computer vision, we created a dataset of spoken digits in English111Note that similar datasets are also available for Arabic  and Japanese  language. of which we hope that it will fill this gap. Due to its conceptual similarity, the dataset will be referred to as AudioMNIST. The dataset allows for several different classification tasks of which we explore spoken digit recognition and recognition of a speaker’s gender here. Specifically, for both these tasks, two deep neural network models are trained on the AudioMNIST dataset, one directly on the raw audio waveforms, the other on time-frequency spectrograms of the data. We used layer-wise relevance propagation (LRP)  to investigate the relationship between input data and network output and demonstrate that the spectrogram-based gender classification is mainly based on differences in lower frequency ranges and furthermore that models trained on raw waveforms focus on a rather small fraction of the input data.
The remaining paper is organized as follows. In Section 2 we present the AudioMNIST dataset, describe the deep models used for gender and digit classification, and introduce LRP as a general technique for explaining classifier’s decisions. Section 3 presents the results on the spoken digit dataset and discusses the interpretations obtained with LRP. Section 4 concludes the paper with a brief summary and discussion of future work.
2 Interpreting & Evaluating Deep Audio Classifiers
This section presents a new benchmark dataset for audio classification and model interpretation, introduces a spectrogram-based and a waveform-based neural network model, and describes a general technique for explaining deep classifiers.
2.1 AudioMNIST dataset
The AudioMNIST dataset222https://github.com/soerenab/AudioMNIST consists of 30000 audio recordings (9.5 hours) of spoken digits (0-9) in English with 50 recordings per digit from each of the 60 different speakers. The audio recordings were collected in quiet offices with a RØDE NT-USB microphone as mono channel signal with a sampling frequency of 48kHz and were saved in 16 bit integer format. In addition to audio recordings, meta information including age (range: 22-61 years), gender (12 female / 48 male), origin and accent of all speakers were collected as well. Digits to be spoken out were presented in random order on a screen and any digit that was misread by a speaker was repeated at the end. All speakers were informed about the intend of the data collection and gave written declaration of consent to participate in it prior to their recording session.
2.2 Audio classification
The AudioMNIST dataset offers several machine learning tasks in the audio domain of which classification of digits and classification of the gender of the speaker are reported on here. Audio classification is often based on spectrogram representations of the data  but successful classification based on raw waveform data has been reported as well . Using a spectrogram representation enables employment of neural network architectures such as AlexNet  or VGG  that were originally designed for image classification. We implemented two networks for classifying spoken digits. One model uses a spectrogram representation as input data, the other the raw waveform.
2.2.1 Classification based on spectrograms
Audio recordings were re-sampled to 8kHz, zero-padded to a fixed signal dimensionality of 8000 and transformed to a spectrogram representation via short-time Fourier transform (STFT). During zero-padding, the audio recording was placed in random positions within the zero-padding, which can be regarded as a form of data-augmentation. The parameters of the short-term Fourier transform were set to yield spectrograms of dimensionswhich were cropped to by discarding the highest frequency bin and the last two time bins. The amplitude of the cropped spectrograms was converted to decibels and used as input to the network. The network architecture was a slight modification of the implementation of AlexNet 
as provided in the Caffe toolbox where the number of input channels was changed to 1 and the dimensions of fully-connected layers were changed to 1024, 1024 and 10.
The dataset was split into five disjoint subsets each containing 6000 spectrograms where samples of any speaker appeared only in one of the five subsets. In a five-fold cross-validation, three of the subsets were merged to a training set while the other two subsets served as validation and test sets. The final, fold-dependent preprocessing step consisted of subtraction of the element-wise mean of the respective training set from all spectrograms. The model was trained with stochastic gradient descent with a batch size of 100 spectrograms for 10000 epochs. The initial learning rate of 0.001 was reduced by a factor of 0.5 every 2500 epochs, momentum was kept constant at 0.9 throughout training and gradients were clipped at a magnitude of 5.
For gender classification, the only difference in the network architecture was the adaptation of the output dimensionality of the final layer to 2 to match the binary labels of this task. Furthermore, dataset preparation differed in that the dataset was initially reduced to the 12 female speakers and 12 randomly selected male speakers. These 24 speakers were split into four disjoint subsets each containing a total of 3000 spectrograms from three female and three male speakers where again, samples of any speaker appeared only in one of the four subsets. In a four-fold cross-validation, two of the subsets were merged to a training set while the other two subsets served as validation and test set. All other preprocessing steps and network training parameters were identical to the task of digit classification.
2.2.2 Classification based on raw waveforms
For classification based on raw waveforms, audio samples were resampled and zero-padded as described in Section 2.2.1, yielding the same signal dimensionality of 8000, which we represent as an (
) tensor by adding two dummy axes (“width” and “depth”) for the convolution operator in the input layer. Afterwards the signal is normalized by the waveform’s 95th amplitude percentile; we did not normalize by a waveform’s maximal amplitude due to some clear outliers caused by environmental noise during the recordings. The resulting waveforms were directly used as input to a CNN inspired by whose architecture is depicted in Fig. 1.
For clarity, this model will be refered to as AudioNet. In case of digit classification, the network was trained with stochastic gradient descent with a batch size of 100 and constant momentum of 0.9 for 50000 epochs with an initial learning rate of 0.0001 which was lowered every 10000 steps by a factor of 0.5. In case of gender classification, training consisted of only 10000 epochs with the learning rate being reduced after 5000.
2.3 Layer-wise relevance propagation
In some fields and domains where interpretability is a key property, linear models are still widely used as the de-facto method for learning and inference due to the inherent explainability of the predictions made, even though this may mean sacrificing potential prediction performance on more complex problems. In , a technique called Layer-wise Relevance Propagation (LRP) was introduced which allows for a decomposition of a learned non-linear predictor output via the interaction of with the components of as relevance values , closing the gap between highly performing but non-linear and interpretable learning machines. An implementation of the algorithm is available in the LRP toolbox .
LRP performs in a top-down manner from the model output to its inputs by iterating over the layers of the network, propagating relevance scores
from neurons of hidden layers step-by-step towards the input. Eachdescribes the contribution an input or hidden variable has made to the final prediction. The core of the method is the redistribution of a relevance value of an upper layer neuron – provided as an input for one computational step of the algorithm – towards the layer inputs , in proportion the contribution of each input to the activation of the output neuron in the forward pass.
The variable describes the forward contribution (or activation energy) sent from input to output and is the aggregation of all forward messages over at . The relevance score at neuron is then obtained by pooling all incoming relevance quantities from neurons to which contributes:
Exact definitions of attributions depend on a layer’s type and position in the pipeline .
We visualize the results using a color map centered at zero, since indicates neutral or no contribution to the global prediction. Positive relevance scores will be shown in hot colors while negative scores are displayed using cold hues. More information about explanation methods for deep neural networks can be found in .
3.1 Classifier performance
Model performances are summarized in Table 1
in terms of means and standard deviations across test splits. AlexNet performs consistently superior to AudioNet, yet for both tasks the networks show test set performances well above the respective chance level, i.e. for both tasks the networks discovered discriminant features within the data. The considerably high standard deviation for gender classification of AudioNet results mainly from a rather consistent misclassification of recordings of a single speaker in one of the test sets.
3.2 Relating network output to input data
3.2.1 Relevance maps for AlexNet
As described in Section 2, LRP computes relevance scores that link input data to a network’s output, i.e. classification decision. Exemplary input data for AlexNet is displayed in Fig. 6, where spectrograms are overlayed with relevance scores for each input position in the (frequency time) STFT spectrograms.
Spectrograms in figures 6LABEL:sub@fig:spectro_0_female_digit and 6LABEL:sub@fig:spectro_1_female_digit correspond to spoken digits zero and one from the same female speaker. AlexNet correctly classifies both spoken digits and the LRP scores reveal that different areas of the input data appear to be relevant for its decision although it is difficult to link the features to higher concepts such as for instance phonemes.
The input spectrogram in Fig. 6LABEL:sub@fig:spectro_0_female_gender is identical to that in Fig. 6LABEL:sub@fig:spectro_0_female_digit and the spectrogram in Fig. 6LABEL:sub@fig:spectro_0_male_gender corresponds to a spoken zero by a male speaker. AlexNet correctly classified both speaker’s gender with most of the relevance distributed in the lower frequency range. Based on the relevance scores it may be hypothesized that gender classification is based on the fundamental frequency and its immediate harmonics which are in fact a known discriminant feature for gender .
3.2.2 Relevance maps for AudioNet
In case of AudioNet relevance scores are obtained in form of an 8000 dimensional vector. An exemplary waveform input of a spokenzero from a male speaker for which the network correctly classifies the gender is presented in Fig. 10LABEL:sub@fig:wave_signal. The relevance scores associated to the classification are depicted in Fig. 10LABEL:sub@fig:wave_hm, of which time frame from second to is closer inspected in Fig. 10LABEL:sub@fig:wave_hm_colored. Intuitively plausible, zero relevance falls onto the zero-embedding at the left and right side of the data. Furthermore, from Fig. 10LABEL:sub@fig:wave_hm_colored it appears that mainly samples of large magnitude are relevant for the network’s classification decision.
3.3 Manipulations of relevant input features
3.3.1 Manipulations for AlexNet
The relevance maps of the AlexNet-like gender classifier suggest the hypothesis that the network focuses on differences in the fundamental frequency and subsequent harmonics for feature selection. To test this hypothesis the test set was manipulated by up- and down-scaling the y-axis of the spectrograms of male and female speakers by a factor of 1.5 and 0.66 respectively such that both fundamental frequency and spacing between harmonics approximately matched the original spectrograms of the respective opposite gender. The trained network reaches an accuracy of only across test splits on data manipulated in this fashion, which is well-below chance level for this task, confirming the hypothesis. In other words, targeting the gender features identified via LRP allows to perform transformations on the inputs targeting the identified features specifically, such that the classifier is accurate in predicting the opposite gender.
Unfortunately, an exact time domain signal for a modified spectrogram is not guaranteed to exist, however an approximation of the waveform corresponding to the manipulated spectrogram may be obtained via the inverse short-term Fourier transform . Manipulations within the thereby acquired audio signals are easily detectable for humans, as voices in the manipulated signal sound rather robotic.
3.3.2 Manipulations for AudioNet
Manipulations of a network’s original input data allow to assess its reliance on relevant features as proposed by LRP. This is achieved by an analysis similar to the pixel-flipping (or input perturbation) method introduced from [6, 35].
This analysis verifies that manipulations of relevant features according to LRP cause larger performance deterioration than manipulations of randomly selected features. We restricted this analysis to AudioNet and manipulated the waveform signals in three different ways. The amount of changed features is the same for all manipulations and determined as a fraction of the non-zero features.
For the first two manipulations only non-zero features are taken into consideration, so that only the actual signal is perturbed. In the first manipulation, a fraction of randomly selected features is set to zero. The second manupulation method, sets features to zero based on highest absolute amplitudes. We do this to test if relevance falls mainly onto samples of high absolute amplitude as suggested by Fig. 10LABEL:sub@fig:wave_hm_colored. For the third manipulation type we set to zero those features with the highest relevance as attributed via LRP. Notice that LRP-based selection is not constrained to avoid samples within the zero-embedding. Network performance on manipulated test sets in relation to the fraction of manipulated samples are displayed in Fig. 14 for both digit and gender classification.
For both gender and digit classification, network performance deteriorates substantially earlier for LRP-based manipulations compared to random manipulations and slightly earlier than for amplitude based manipulations. This becomes most apparent for digit classification where a manipulation of 1% of the data leads to a deterioration of model accuracy from to 92% for random, 85% for amplitude-based and 77% for LRP-based manipulations respectively.
In case of gender classification, the network furthermore shows a remarkable robustness towards random manipulations with classification accuracy only starting to decrease when 60% of the signal has been set to zero as shown in Fig. 14LABEL:sub@fig:perturbation_digit_setzero_gender. The accuracy for random and amplitude-based manipulation drops to chance level when 100% of the signal is set to zero. Noteworthy, LRP-based manipulations counter-intuitively converge with a small offset. This is due to the difference in sample selection, as LRP-based selection is not constraint to non-zero values. Fig. 10 shows that samples in the zero-embedding receive relevance of zero and are hence selected prior to samples within the signal that receive negative relevance. As a consequence, there are still non-zero samples in the 100% LRP-manipulated signals which lead to the deviation from chance level performance.
For an increasing number of machine learning tasks being able to interpret the decision of a model becomes inevitable. So far most research has focused on explaining image classifiers. To foster research of interpreting audio classification models we provide a dataset of spoken digits in the English language as raw waveform features. We demonstrated that layer-wise relevance propagation is a suitable interpretability method for explaining deep neural networks for audio classification. In the case of gender classification based on spectrograms, LRP allowed us to form a hypothesis about features employed by the network. In case of digit classification, LRP reveals distinctive patterns for different classes. However, the derivation of higher-order concepts such as phonemes or certain frequency ranges proved to be more difficult than for gender classification. Classification on raw waveforms showed that the network bases its decision on a relatively small fraction of highly relevant samples. A possible explanation for this effect that the network focuses mainly on the “global” shape of the input – and subject for future work – could be: Randomly selected samples are uniformly distributed over the time course of the signal such that – as long as the fraction of manipulated samples is not too large – there remain samples with the original amplitude in each local neighborhood of the signal retaining the original shape of the signal. On the other hand, amplitude- and LRP-based selection may corrupt the signal in a way such that the global shape can no longer be recognized.
In future work we will apply LRP to more complex audio datasets to gain a deeper insight into classification decisions of deep neural networks in this domain. Furthermore, we will relate the strategies learned by the neural networks to the traditional, hand-designed features extracted from audio signals such as the spectral, temporal and Mel-frequency cepstral coefficients (MFCC) features, and psychoacoustic features (e.g. roughness, loudness, sharpness), which have proven to be very effective for audio classification and analysis.
-  R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad, “Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission,” in 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015, pp. 1721–1730.
-  F. Doshi-Velez and B. Kim, “Towards a rigorous science of interpretable machine learning,” arXiv:1702.08608, 2017.
G. Hinton, S. Osindero, M. Welling, and Y.-W. Teh, “Unsupervised discovery of nonlinear structure using contrastive backpropagation,”Cognitive Science, vol. 30, no. 4, pp. 725–731, 2006.
-  D. Erhan, Y. Bengio, A. Courville, and P. Vincent, “Visualizing higher-layer features of a deep network,” University of Montreal, vol. 1341, no. 3, p. 1, 2009.
-  D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, and K.-R. Müller, “How to explain individual classification decisions,” Journal of Machine Learning Research, vol. 11, no. Jun, pp. 1803–1831, 2010.
-  S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek, “On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation,” PLOS ONE, vol. 10, no. 7, p. e0130140, 2015.
-  A. Shrikumar, P. Greenside, A. Shcherbina, and A. Kundaje, “Not just a black box: Learning important features through propagating activation differences,” arXiv:1605.01713, 2016.
-  R. C. Fong and A. Vedaldi, “Interpretable explanations of black boxes by meaningful perturbation,” in IEEE International Conference on Computer Vision (ICCV), 2017, pp. 3449–3457.
-  G. Montavon, S. Bach, A. Binder, W. Samek, and K.-R. Müller, “Explaining nonlinear classification decisions with deep taylor decomposition,” Pattern Recognition, vol. 65, pp. 211–222, 2017.
-  J. Li, X. Chen, E. H. Hovy, and D. Jurafsky, “Visualizing and understanding neural models in NLP,” in Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2016, pp. 681–691.
-  I. Sturm, S. Lapuschkin, W. Samek, and K.-R. Müller, “Interpretable deep neural networks for single-trial eeg classification,” Journal of Neuroscience Methods, vol. 274, pp. 141–145, 2016.
-  K. T. Schütt, F. Arbabzadah, S. Chmiela, K.-R. Müller, and A. Tkatchenko, “Quantum-chemical insights from deep tensor neural networks,” Nature Communications, vol. 8, p. 13890, 2017.
H. Lee, P. Pham, Y. Largman, and A. Y. Ng, “Unsupervised feature learning for audio classification using convolutional deep belief networks,” inAdvances in Neural Information Processing Systems (NIPS), 2009, pp. 1096–1104.
-  G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-R. Mohamed, N. Jaitly et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
-  L. Deng, G. Hinton, and B. Kingsbury, “New types of deep neural network learning for speech recognition and related applications: An overview,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 8599–8603.
W. Dai, C. Dai, S. Qu, J. Li, and S. Das, “Very deep convolutional neural networks for raw waveforms,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 421–425.
-  L. R. Rabiner and B.-H. Juang, Fundamentals of speech recognition. PTR Prentice Hall Englewood Cliffs, 1993, vol. 14.
-  M. Anusuya and S. K. Katti, “Speech recognition by machine; a review,” International Journal of Computer Science and Information Security, vol. 6, no. 3, pp. 181–205, 2009.
-  J. J. Godfrey, E. C. Holliman, and J. McDaniel, “Switchboard: Telephone speech corpus for research and development,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, 1992, pp. 517–520.
-  J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, “Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1-1.1,” NASA STI/Recon Technical Report N, vol. 93, 1993.
-  V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210.
Y. LeCun, “The mnist database of handwritten digits,”http://yann.lecun.com/exdb/mnist/, 1998.
-  N. Hammami and M. Sellam, “Tree distribution classifier for automatic spoken arabic digit recognition,” in International Conference for Internet Technology and Secured Transactions (ICITST), 2009, pp. 1–4.
-  K. Nagata, Y. Kato, and S. Chiba, “Spoken digit recognizer for the japanese language,” Journal of the Audio Engineering Society, vol. 12, no. 4, pp. 336–342, 1964.
-  S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore et al., “CNN architectures for large-scale audio classification,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 131–135.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inAdvances in Neural Information Processing Systems (NIPS), 2012, pp. 1097–1105.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556, 2014.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in ACM International Conference on Multimedia (MM), 2014, pp. 675–678.
-  S. Lapuschkin, A. Binder, G. Montavon, K.-R. Müller, and W. Samek, “The layer-wise relevance propagation toolbox for artificial neural networks,” Journal of Machine Learning Research, vol. 17, no. 114, pp. 1–5, 2016.
-  S. Lapuschkin, A. Binder, G. Montavon, K.-R. Muller, and W. Samek, “Analyzing classifiers: Fisher vectors and deep neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2912–2920.
-  G. Montavon, W. Samek, and K.-R. Müller, “Methods for interpreting and understanding deep neural networks,” Digital Signal Processing, vol. 73, pp. 1–15, 2018.
-  H. Traunmüller and A. Eriksson, “The frequency range of the voice fundamental in the speech of male and female adults,” Unpublished manuscript, 1995.
D. Griffin and J. Lim, “Signal estimation from modified short-time fourier transform,”IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 32, no. 2, pp. 236–243, 1984.
-  W. Samek, A. Binder, G. Montavon, S. Lapuschkin, and K.-R. Müller, “Evaluating the visualization of what a deep neural network has learned,” IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 11, pp. 2660–2673, 2017.
-  R. Gonzalez, “Better than mfcc audio classification features,” in The Era of Interactive Media. Springer, 2013, pp. 291–301.