A Case Study of Deep-Learned Activations via Hand-Crafted Audio Features

07/03/2019 ∙ by Olga Slizovskaia, et al. ∙ 0

The explainability of Convolutional Neural Networks (CNNs) is a particularly challenging task in all areas of application, and it is notably under-researched in music and audio domain. In this paper, we approach explainability by exploiting the knowledge we have on hand-crafted audio features. Our study focuses on a well-defined MIR task, the recognition of musical instruments from user-generated music recordings. We compute the similarity between a set of traditional audio features and representations learned by CNNs. We also propose a technique for measuring the similarity between activation maps and audio features which typically presented in the form of a matrix, such as chromagrams or spectrograms. We observe that some neurons' activations correspond to well-known classical audio features. In particular, for shallow layers, we found similarities between activations and harmonic and percussive components of the spectrum. For deeper layers, we compare chromagrams with high-level activation maps as well as loudness and onset rate with deep-learned embeddings.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper, we focus on feature analysis in the music domain. Our goal is to find similar patterns between the features (activations and activation maps) learned by a network and hand-crafted audio features, which are well understood in the literature. For that purpose, we analyse features from a dataset of user-generated recordings of different musical instrument performances. We address musical instrument recognition as it is a well-defined task and it can be objectively evaluated.

For feature attribution understanding, there are two major directions: (1) perturbation based algorithms, such as LIME (Ribeiro et al., 2016), Axiomatic Attribution (Sundararajan et al., 2017) or Saliency Analysis (Montavon et al., 2017)

, and (2) gradient-based algorithms such as Guided Backpropagation

(Simonyan et al., 2013; Montavon et al., 2017), Class-Activation Mapping (CAM) (Zhou et al., 2016), and Network Dissection (Bau et al., 2017). In music domain, SoundLIME (Mishra et al., 2017) algorithm has been adapted from the original LIME. However, in most cases, the above techniques can be limitedly applied to spectrograms because, unlike a typical image, two dimensions of a spectrogram represent different qualities namely time and frequency.

Therefore, manual feature exploration remains popular. One could create a playlist which corresponds to a particular neuron, and make a decision of this neuron ’specialization’ by listening to the playlist. This approach was proposed by (Dieleman, 2014) and it provides valuable insights. However, it is not scalable because it requires an expert to listen to the playlist and guess the rationale behind.

Also, we can take advantage of a number of well-established mid-level audio features that have been proposed and studied in the MIR literature (Schedl et al., 2014)

. We know that CNNs in computer vision learn boundaries in the first layer and more complex concepts in subsequent layers. We hypothesize that audio-based CNNs can occasionally learn some of the hand-crafted features in a similar manner. We try to identify those features in pre-trained neural networks.

2 Methodology

Hand-crafted audio features. We focus our study in a compact set of mid-level features related to different musical facets: onset rate, loudness and Harmonic Pitch Class Profile (HPCP) computed by Essentia (Bogdanov et al., 2013), and Harmonic/Percussive Sound Separation (HPSS) computed by librosa (McFee et al., 2015).

Network Architectures. We explore three state-of-the-art VGG-style architectures: CNN AudioTagger (CNN-AT) (Choi et al., 2016), VGGish (Hershey et al., 2017), and Musically Motivated CNN (MM-CNN) (Pons et al., 2017)

. All three receive mel-spectrum as the input, consist of blocks of convolutional and max-pooling layers, and dense layers.

The differences between architectures and their initializations include filters’ shape (squared filters in CNN-AT and VGGish, and rectangular filters in MM-CNN), activation function and pre-training settings. We trained CNN-AT and MM-CNN on a subset of FCVID

(Jiang et al., 2015) dataset. VGGish is initialized with weights provided by the authors. This network has been trained on a large-scale AudioSet dataset (Gemmeke et al., 2017) and potentially have stronger discriminative ability.

Similarity measures: individual activations

. For high-level embeddings of a network, we consider each activation as an individual feature and compare them with onset rate and mean loudness. We consider two similarity metrics: (1) Pearson Correlation Coefficient and (2) Euclidean distance over the normalized vectors.

Similarity measures: activation maps

. Activations of convolutional layers have a form of a matrix. They are slightly offset from the original input spectrum due to the padding, and proportionally scaled to the input because of max pooling. To some extent, we can think of them as pseudo-spectrograms or as filtered and aggregated spectrograms. In order to compare those activations with HPSS or HPCP, we need a method for fuzzy matrix comparison which is scale- and shift-invariant. We propose a visual-inspired similarity metric based on Scale-Invariant Feature Transform (SIFT)

(Lowe, 2004) descriptors. SIFT descriptors are among the most recognized features in computer vision and a reasonable choice for similarity measurement (Hua et al., 2012).

To compute similarity between a feature map and an activation map we compute SIFT descriptors and matches between descriptors. An example of matching is shown in Figure 1. Each match is characterized by the matched descriptor indexes and a matching distance.

(a) Top: original log-mel-spectrum. Bottom: logharmonic component of HPSS, scaled. SIFT matches are connected.
(b) Top: original log-mel-spectrum. Bottom: logpercussive component of HPSS, scaled. SIFT matches are connected.
Figure 1: An example of SIFT matching for scaled harmonic (0(a)) and percussive (0(b)) parts of HPSS and shifted spectrum.

3 Experiments and Results

High-level embeddings vs. onset rate and loudness. We explored three high-level activation layers of VGGish model: an embedding layer with 128 neurons and two fully-connected layers with 4096 neurons each. For the embedding layer, we found statistically significant correlations for both onset rate and loudness, and some examples of the corresponding features are shown in Figure 2. In the first fully-connected layer we discovered that neuron #1964 has an outstanding correlation with loudness (with correlation coefficient ). For CNN-AT we found that activation #259 corresponds to onset rate.

(a) Loudness/Activation #59.
(b) Onset rate/Activation #127.
(c) Loudness/Activation #117.
(d) Onset rate/Activation #56.
Figure 2: An example of correspondences between VGGish embeddings and mid-level audio features: 1(a) and 1(b) are correlation-based correspondences, 1(c) and 1(d) are -distance based correspondences.

Low-level feature correspondences. We found a number of interesting activation maps which look similar to HPSS decomposition in the first convolutional layer of VGGish network. The histograms of similarity metrics with respect to activation maps can be found in supplementary materials.111Supplementary materials (high resolution figures, code and more examples) are located at https://goo.gl/jM3jZM. The second convolutional layer of VGGish network does not have a strong correspondence to HPSS decomposition even though some linear combinations of activation maps could be similar.

For CNN-AT network we examine the second convolutional layer and we observe that similarity metric histograms for HPSS decomposition are not consistent which might be related to a higher false matching rate between decompositions and activation maps. Finally, the first layers of MM-CNN architecture represent a strongly filtered spectrograms, so we presume that the tall rectangular filters of this architecture are similar to band-pass filters.

4 Conclusion

Even if the models we investigate are complex and allow to construct features in a very different way than traditional methods, the correspondences between hand-crafted features and activations provide insights for better understanding of the internal representations of CNNs. We believe that the proposed methodology can be applied to identify important neurons in other tasks and architectures.

5 Acknowledgement

This work has received funding from the Spanish Ministry of Economy and Competitiveness under the Maria de Maeztu Units of Excellence Programme (MDM-2015-0502) and the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant 770376, TROMPA). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X GPU used for this research.


  • Bau et al. (2017) Bau, D., Zhou, B., Khosla, A., Oliva, A., and Torralba, A. Network dissection: Quantifying interpretability of deep visual representations. In

    Computer Vision and Pattern Recognition

    , 2017.
  • Bogdanov et al. (2013) Bogdanov, D., Wack, N., Gómez, E., Gulati, S., Herrera, P., Mayor, O., Roma, G., Salamon, J., Zapata, J. R., and Serra, X. ESSENTIA: an Audio Analysis Library for Music Information Retrieval. In International Society for Music Information Retrieval Conference (ISMIR’13), pp. 493–498, 2013.
  • Choi et al. (2016) Choi, K., Fazekas, G., and Sandler, M. Automatic Tagging Using Deep Convolutional Neural Networks. In International Society of Music Information Retrieval Conference. ISMIR, 2016.
  • Dieleman (2014) Dieleman, S. Recommending music on Spotify with deep learning, 2014. URL http://benanne.github.io/2014/08/05/spotify-cnns.html.
  • Gemmeke et al. (2017) Gemmeke, J.F., Ellis, D. P. W., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., Plakal, M., and Ritter, M. Audio Set: An ontology and human-labeled dataset for audio events. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.
  • Hershey et al. (2017) Hershey, S., Chaudhuri, S., Ellis, D. P. W., Gemmeke, J. F., Jansen, A., Moore, C., Plakal, M., D., Platt., Saurous, R. A., Seybold, B., Slaney, M., Weiss, R., and Wilson, K. CNN Architectures for Large-Scale Audio Classification. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2017.
  • Hua et al. (2012) Hua, S., Chen, G., Wei, H., and Jiang, Q. Similarity measure for image resizing using SIFT feature. EURASIP Journal on Image and Video Processing, 2012(1):6, 2012.
  • Jiang et al. (2015) Jiang, Y.-G., Wu, Z, Wang, J., Xue, X., and Chang, S.-F. Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks. arXiv preprint arXiv:1502.07209, 2015.
  • Lowe (2004) Lowe, David G. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60:91–110, 2004.
  • McFee et al. (2015) McFee, B., Raffel, C., Liang, D., Ellis, D. P. W., McVicar, M., Battenberg, E., and Nieto, O. librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference, pp. 18–25, 2015.
  • Mishra et al. (2017) Mishra, S., Sturm, B., and Dixon, S. Local Interpretable Model-Agnostic Explanations for Music Content Analysis. In International Society of Music Information Retrieval Conference, 2017.
  • Montavon et al. (2017) Montavon, G., Lapuschkin, S., Binder, A., Samek, W., and Müller, K.-R. Explaining nonlinear classification decisions with deep taylor decomposition. Pattern Recognition, 65:211–222, 2017.
  • Pons et al. (2017) Pons, J., Slizovskaia, O., Gong, R., Gómez, E., and Serra, X. Timbre Analysis of Music Audio Signals with Convolutional Neural Networks. In 25th European Signal Processing Conference (EUSIPCO). IEEE, 2017.
  • Ribeiro et al. (2016) Ribeiro, M. T., Singh, S., and Guestrin, C.

    Why should i trust you?: Explaining the predictions of any classifier.

    In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM, 2016.
  • Schedl et al. (2014) Schedl, Markus, Gómez, Emilia, Urbano, Julián, et al. Music information retrieval: Recent developments and applications. Foundations and Trends® in Information Retrieval, 8(2-3):127–261, 2014.
  • Simonyan et al. (2013) Simonyan, K., Vedaldi, A., and Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
  • Sundararajan et al. (2017) Sundararajan, M., Taly, A., and Yan, Q. Axiomatic attribution for deep networks. In International Conference on Machine Learning, pp. 3319–3328, 2017.
  • Zhou et al. (2016) Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba, A.

    Learning deep features for discriminative localization.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929, 2016.