Visualizing Deep Neural Networks for Speech Recognition with Learned Topographic Filter Maps

The uninformative ordering of artificial neurons in Deep Neural Networks complicates visualizing activations in deeper layers. This is one reason why the internal structure of such models is very unintuitive. In neuroscience, activity of real brains can be visualized by highlighting active regions. Inspired by those techniques, we train a convolutional speech recognition model, where filters are arranged in a 2D grid and neighboring filters are similar to each other. We show, how those topographic filter maps visualize artificial neuron activations more intuitively. Moreover, we investigate, whether this causes phoneme-responsive neurons to be grouped in certain regions of the topographic map.


page 1

page 2

page 3


Visualizing Deep Neural Networks with Topographic Activation Maps

Machine Learning with Deep Neural Networks (DNNs) has become a successfu...

Graph Spectral Regularization for Neural Network Interpretability

Deep neural networks can learn meaningful representations of data. Howev...

Performance Evaluation of Deep Convolutional Maxout Neural Network in Speech Recognition

In this paper, various structures and methods of Deep Artificial Neural ...

Biologically Inspired Semantic Lateral Connectivity for Convolutional Neural Networks

Lateral connections play an important role for sensory processing in vis...

Gradient-Adjusted Neuron Activation Profiles for Comprehensive Introspection of Convolutional Speech Recognition Models

Deep Learning based Automatic Speech Recognition (ASR) models are very s...

Language Through a Prism: A Spectral Approach for Multiscale Language Representations

Language exhibits structure at different scales, ranging from subwords t...

Net2Vec: Quantifying and Explaining how Concepts are Encoded by Filters in Deep Neural Networks

In an effort to understand the meaning of the intermediate representatio...

1 Introduction

Improving the performance of Deep Learning (DL) models is often achieved by increasing their complexity in number of neurons Szegedy et al. (2015). In this regard, their complexity becomes closer to that of real brains. This inspires utilizing established methods from neuroscience to analyze Artificial Neural Networks Kriegeskorte et al. (2008). In our previous work, we investigated adaptations of the Event-Related Potential (ERP) technique Luck (2005) to analyze neural networks. For a convolutional speech recognizer, we analyzed network responses for words Krug and Stober (2017), predicted graphemes Krug and Stober (2018) and phonemic categories Krug et al. (2018). For deeper layers, we found that neuron activations are not visually interpretable, because their order within a layer is permutable.

Commonly, real brain activity is measured with Electroencephalography (EEG) and the electrode activations are visualized as a topographic map. Such maps are top-view images of the head, which show the electrode activations at their respective location Koles (1991)

. For areas between different electrode locations, the activation is interpolated. Adapting this technique to visualize

ANNs needs a topography of artificial neurons.

In this work, we adopt a regularization strategy by Kavukcuoglu et al. (2009) to learn topographic maps of filters. The authors aimed to automatically learn locally-invariant feature descriptors for sparse coding algorithms in an unsupervised fashion. Their idea was to arrange filters in a 2D grid and encourage similarity of neighboring coefficients. Here, we use this strategy to learn convolutional filters in a 2D grid to visualize neuron activations as topographic maps. We hypothesize that different regions in the 2D grid are active for particular groups of inputs. We investigate, whether there are distinct regions in the topographic map which are active for particular phonemes.

Figure 1: Close-up views on the regularized filter space in the first layer (A). Network responses to exemplary phonemes in the second layer (B top) and feature visualization for the most responsive region (B bottom).

2 Methods

2.1 Learning topographic filter maps

We learn topographic filter maps by constraining the optimization. First, a grid is defined for every layer. We arrange layers with 256 filters in 1616 grids and 2048-filter layers in 6432 grids. Filters are encouraged to be similar within a n

n-neighborhood. For each neighborhood, which a filter is included in, the similarity to the center filter is computed. Those similarities are weighted by the reciprocal Euclidean distance to the center position, while the weights are normalized to sum up to 1 per neighborhood. The loss function constraint is the sum of weighted similarities over all filters and all neighborhoods. Here, we minimize Cosine similarity in a 3

3 neighborhood. The Cosine similarity between two vectors

and is defined as . As this constraint mainly affects the order of the neurons, it negligibly interferes with model performance. Figure 1A shows two exemplary neighborhoods in a 1616 topographic filter map.

2.2 Determining group-specific network responses

To characterize how the network responds to particular groups of inputs, we use Gradient-adjusted Neuron Activation Profiles (under review). GradNAPs are an extension of our previously described Neuron Activation Profiles Krug et al. (2018). This method is inspired by the ERP technique. The idea is to average neuron activations for all inputs corresponding to the same group. For our speech recognizer, we averaged activations over particular phonemes or graphemes. Due to normalizing the average activations by subtracting baseline activations, GradNAP values can be negative or positive.

3 Results & Discussion

Figure 1B shows results of our method for 3 exemplary phonemes. The square plots (top half) show time-averaged GradNAP

values on the learned topographic map. Plotting the values directly (“on grid”) is hard to visually interpret, because the values do not show smooth transitions. To achieve a visually appealing topographic map, we apply non-strided average pooling in 3

3 windows (“smoothed”). In the resulting map, we locate the maximum value to identify the 33-region of highest phoneme-responsiveness. For the identified regions, we compute optimal inputs Yosinski et al. (2015) for each filter in the region separately and jointly for the responsive neighborhood. The optimal inputs are shown in the bottom row of Figure 1B, “single-filter” and “joint”, respectively. We observed that only considering the single most responsive region does not reveal phoneme-typical patterns. This indicates that the representations are still more distributed on the grid. For example, many regions of the grid are strongly activated for phoneme /T/

. The most responsive region is therefore likely missing some parts of the distributed representation.

4 Conclusion

Topographic filter maps are a promising way of using well-established methods from neuroscience to visualize Deep Neural Networks. The learned ordering of the neurons allows to show activations in a way which is more intuitive for a person. However, the current similarity constraint is not encouraging enough representational sparsity. Features are still distributed between too many regions in the topographic map. Therefore, optimizing inputs for particular regions in the grid does not yield interpretable patterns. In future work, we will investigate more regularization strategies. This will include adapting the constraint to regularize activations instead of filter weights and incorporating global similarity penalties.


This research has been funded by the Federal Ministry of Education and Research of Germany (BMBF) and supported by the donation of a GeForce GTX TitanX graphics card from the NVIDIA Corporation.


  • K. Kavukcuoglu, R. Fergus, Y. LeCun, et al. (2009) Learning invariant features through topographic filter maps. In

    2009 IEEE Conference on Computer Vision and Pattern Recognition

    pp. 1605–1612. Cited by: §1.
  • Z. J. Koles (1991) The quantitative extraction and topographic mapping of the abnormal components in the clinical EEG. Electroencephalography and Clinical Neurophysiology 79 (6), pp. 440–447. External Links: ISBN 0013-4694, ISSN 00134694 Cited by: §1.
  • N. Kriegeskorte, M. Mur, and P. A. Bandettini (2008) Representational similarity analysis-connecting the branches of systems neuroscience. Frontiers in systems neuroscience 2, pp. 4. Cited by: §1.
  • A. Krug, R. Knaebel, and S. Stober (2018) Neuron activation profiles for interpreting convolutional speech recognition models. In Proceedings of the 2018 NeurIPS Workshop IRASL: Interpretability and Robustness for Audio, Speech, and Language, Cited by: §1, §2.2.
  • A. Krug and S. Stober (2017) Adaptation of the event-related potential technique for analyzing artificial neural networks. In Cognitive Computational Neuroscience (CCN), Cited by: §1.
  • A. Krug and S. Stober (2018)

    Introspection for convolutional automatic speech recognition

    In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 187–199. Cited by: §1.
  • S. J. Luck (2005) An Introduction to the Event-Related Potential Technique. Monographs of the Society for Research in Child Development 78 (3), pp. 388. External Links: ISBN 0262122774, ISSN 1540-5834 Cited by: §1.
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §1.
  • J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson (2015) Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579. Cited by: §3.