Log In Sign Up

Toward Interpretable Music Tagging with Self-Attention

by   Minz Won, et al.
Universitat Pompeu Fabra

Self-attention is an attention mechanism that learns a representation by relating different positions in the sequence. The transformer, which is a sequence model solely based on self-attention, and its variants achieved state-of-the-art results in many natural language processing tasks. Since music composes its semantics based on the relations between components in sparse positions, adopting the self-attention mechanism to solve music information retrieval (MIR) problems can be beneficial. Hence, we propose a self-attention based deep sequence model for music tagging. The proposed architecture consists of shallow convolutional layers followed by stacked Transformer encoders. Compared to conventional approaches using fully convolutional or recurrent neural networks, our model is more interpretable while reporting competitive results. We validate the performance of our model with the MagnaTagATune and the Million Song Dataset. In addition, we demonstrate the interpretability of the proposed architecture with a heat map visualization.


page 6

page 10

page 11

page 12

page 13


Visualizing and Understanding Self-attention based Music Tagging

Recently, we proposed a self-attention based music tagging model. Differ...

Semi-Supervised Music Tagging Transformer

We present Music Tagging Transformer that is trained with a semi-supervi...

Self-attention based BiLSTM-CNN classifier for the prediction of ischemic and non-ischemic cardiomyopathy

Approximately 26 million individuals are suffering from heart failure, a...

Global Structure-Aware Drum Transcription Based on Self-Attention Mechanisms

This paper describes an automatic drum transcription (ADT) method that d...

Deep-Learning Architectures for Multi-Pitch Estimation: Towards Reliable Evaluation

Extracting pitch information from music recordings is a challenging but ...

Music theme recognition using CNN and self-attention

We present an efficient architecture to detect mood/themes in music trac...

Music Transformer

Music relies heavily on repetition to build structure and meaning. Self-...

1 Introduction

Following the huge successes in the fields of computer vision (CV) and natural language processing (NLP), convolutional neural networks (CNN) and recurrent neural networks (RNN) have successfully demonstrated their versatility in the field of music information retrieval (MIR). Deep architectures using CNN and RNN are now

de facto state-of-the-art in multiple MIR tasks including classification [6, 3, 28, 16], beat detection [18], music transcription [40, 9], and even music generation [32, 13]. While traditional approaches in MIR extract relevant features for the target task based on domain knowledge, especially signal processing, recent works learn the features automatically from voluminous data using deep architectures.

Automatic music tagging is a multi-label classification task to predict music tags in accordance with the music audio contents. Music tags include high-level information, such as genre (rock, jazz), mood (happy, sad), and instrumentation (violin, guitar, piano), which can be utilized for music discovery and recommendation [3]. Since CNN are powerful architectures that facilitate capturing local characteristics, their applications on music tagging could firmly establish the state-of-the-art results [28, 16].

However, we believe that music is sequential and it composes its high-level semantics based on the relations between individual components in long-term sparse positions, not only based on the local information. On analogous motivations, Choi et al. adopted convolutional recurrent neural networks (CRNN) [5] and Pons et al. tried to depict deep architectures in two parts: front-end and back-end [28]

. The front-end, which is equivalent to the CNN part of CRNN, learns local features. The back-end, which corresponds to the RNN part of CRNN, captures the structure of learned local features. Although they reported remarkable results, they are not suitable for modeling the long-term context. To encapsulate long-term context with CNN back-end, deep stacks of convolutional layers followed by subsampling layers (mostly max-pooling) are required, which will end up with blurred time resolution. RNN back-end with longer sequence inputs suffers from the demand of huge computational power and gradient vanishing/exploding problems

[27]. In addition, CNN for MIR are yet less interpretable despite there has been noteworthy previous research to explain the predictions [4, 24, 25]. One possible reason is that spectrogram-based 2D CNN models which have been used in the research learn spectro-temporal characteristics in each layer, while music is a temporal sequence of individual audio events.

Self-attention is an attention mechanism that learns a representation by relating different positions in the sequence. It facilitates the model to learn long-term context by relating each pair of positions directly. The transformer [39], which is a sequence model solely based on self-attention, and its variants [8, 31] showed compelling results on extensive NLP tasks. Inspired by this, we propose to adopt the successful architecture to the back-end of music tagging models. By this means, one can expect not only the performance but also the interpretability.

In the following section (Section 2), we review related music tagging models and the self-attention mechanism in detail. Then we depict the architecture of the proposed model (Section 3) and dataset (Section 4). Section 5 includes experimental results, careful ablation studies, and interpretable visualization of attention maps. Finally, we conclude with future works in Section 6.

2 Related work

2.1 Deep Architectures for Music Tagging

Choi et al. proposed to use fully convolutional networks (FCN) for music tagging [3]. This architecture is also called vgg-ish CNN because it uses stacks of convolution filters as proposed in [35]. It was broadly used to solve MIR problems since one can take advantage of time-frequency invariance and its robustness to distortion.

Pons et al. exploited domain knowledge to elaborate musically motivated convolution filter designs for music tagging [28]. Vertical [30] and Horizontal [29] filters were designed to capture timbral and temporal information, respectively, and combinations of both filters could achieve superb results in music tagging. This architecture will be used as one of our baselines that uses spectrogram inputs.

Lee et al. proposed a more end-to-end oriented architecture design which uses raw audio as its input, known as sample-level CNN [20]

. There is no need for short-time Fourier transform (STFT) to get spectrograms in this architecture. Sample-level CNN could demonstrate their appropriacy in MIR tasks

[20, 16] and it is known that the sample-level CNN show better results in bigger datasets [28]. We use sample-level CNN as our another baseline that uses raw audio inputs.

To interpret trained models, previous works [4, 24, 25] elaborated to visualize and auralize learned information. However, current visualization and auralization are yet less interpretable. We assume the reason is due to the model architecture, which learns local spectro-temporal information (with stacks of filters) instead of modeling the input as a sequence of individual audio events.

2.2 Self-attention

The self-attention mechanism has become a substitute for RNN to capture a long-range structure within sequential data. Unlike RNN, a self-attention module computes the response at a location in a sequence by attending to all locations within the same sequence. Recently, the Transformer [39]

has shown by solely using self-attention modules without RNN, the model could achieve state-of-the-art performance in neural machine translation (NMT) task. Similarly, self-attention has achieved successful classification performance with interpretability in video classification

[41, 43] and text classification [22] tasks.

Self-attention is also used for generative models such as generative adversarial networks (GAN) [44] and auto-regressive models [26, 13]. In particular, the Music Transformer [13] has shown that self-attention modules could model the long term dependency for musical representations using symbolic data, such as MIDI. And Wave2Midi2Wave [10] expanded the research toward raw audio by adopting the Onsets and Frames [9] to transcribe the raw audio (wave2midi), the Music Transformer [13] to generate MIDI notes, and the Wavenet [38] to generate raw audio from the MIDI notes(midi2wave).

Finally, self-attention also achieved great success in large scale pre-trained language models such as Google BERT [8]

and OpenAI GPT-2


3 Proposed architecture

In this section, we introduce the main motivation of proposed research and depict the details of front-ends and back-ends that we used.

Convolutional recurrent neural networks (CRNN) were designed to capture local characteristics and their temporal representations using convolutional layers and following recurrent layers, respectively. Motivated from successful applications of CRNN in document classification [37], image classification [45], and music transcription [34], Choi et al. adopted CRNN to automatic music tagging [5].

In the same context, Pons et al. proposed to divide deep neural networks for MIR into two parts: front-end and back-end [28]. Front-end maps input signal to a latent-space and back-end predicts the output based on the obtained representations from the front-end.

In summary, the aforementioned two models are both using CNN front-end but one uses RNN back-end [5] and another uses CNN back-end [28]. By this mean, we can expect the front-end to capture local information: e.g., timbre, pitch, and chord; and the back-end to capture more structural information: e.g., rhythmic patterns, melodic contours, and chord progressions; based on the combination of the captured local components. As we explored in Section 2.2, previous research has already proven the robustness of self-attention for long-term sequence modeling by stacking them. Hence, we propose a music tagging model consists of CNN front-end and self-attention back-end. From now, we call each model as ‘Frontend_Backend’: e.g., Spec_Att means a model using spectrogram based front-end and our attention based back-end. Following subsections denote the architectures of front-end and back-end that we used.

Figure 1: Spec front-end. B, C, F, and T stand for batch, channel, frequency, and time dimension.
Spec Raw
Layer Filter shape Layer Filter shape
1 1
1 2
1 3
1 4
1 5
1 6
1 7
1 8
1 9
1 10
Table 1: Filter shapes of Spec front-end and Raw front/back-end. Dimensions of filters are or .

3.1 Front-end

In accordance with previous work [28], two different front-ends were tested: 2D CNN using spectrogram inputs and 1D CNN using raw audio waveform.

The 2D CNN front-end in our experiment is an architecture that can leverage domain knowledge [28]. To facilitate learning timbral and temporal patterns, vertical and horizontal filter shapes were designed, respectively. Vertical filters [30] capture short-time spectro-temporal features. After the convolution on input spectrograms, extracted feature maps are max-pooled along the frequency axis. By this mean, the appearance of each instrument will be captured while pitch related information to be ignored. Horizontal filters [29]

capture temporal energy flux patterns in up to 2.6s sequence. Horizontal filters receive average-pooled (along with frequency axis) spectrograms as their inputs. Since vertical filters have a max-pooling layer after the convolutional layer, and horizontal filters have an average-pooling layer before the convolutional layer, the frequency axis of the tensors can be flattened — see Figure

1. Flattened two feature maps are concatenated along channels. We call this spectrogram based front-end as Spec front-end. Spec front-end uses 256 frames (4.1s) of spectrogram chunk as its input.

Sample-level CNN [20, 16] stack short grain of one dimensional convolution filters (e.g. ) to model the music sequence. We call the front-end using sample-level CNN as Raw. Strictly, there is no clear boundary of front-end and back-end in the sample-level CNN since it consists of homogeneous 1D convolutional layers. However, to examine our self-attention back-end, we regarded the first five convolutional layers as a front-end since one frame in the feature map after the five layers can include 15.2ms of audio which can be compared with one frame of spectrograms (16ms). Only when we use self-attention back-end, for the fair comparison, Raw front-end is followed by one convolutional layer since vertical filters of Spec front-end have capacities of up to 7 frames (112ms). Raw front-end uses 65,610 samples (4.1s) of raw audio chunk. Detailed number of parameters are described in Table 1.

Figure 2: CNN back-end.

3.2 Back-end (CNN)

The spectrogram back-end uses stacks of 1D CNN with residual connections

[11] — see Figure 2. Channel size is 512 for each layer. We denote this back-end as CNN name after Pons et al. [28].

As we reviewed in the previous subsection, sample-level CNN do not have a clear boundary of front-end and back-end. For convenience, we call the latter five layers as CNN back-end name after Lee et al. [20]. In the end, Raw_CNN model consists of ten 1D convolutional layers as proposed in the original paper [20]

. Each layer of both back-ends also uses batch normalization and ReLU non-linearity.

3.3 Back-end (Self-Attention)

Figure 3: Self-attention mechanism.

In a field of NLP, self-attention is used to build higher-level semantic by relating each component appeared in a sequence. From the given query (

), the machine learns the relation between the query and keys (

) to compute attention scores, and multiply the attention scores to the values (). Finally, the sum of attended values composes the semantics of the given query. For example, there is a sentence “I play bass”. With “bass” alone, we don’t know if it is a fish or an instrument. We know it is an instrument based on the context because it has “play” in the sentence. When we want to know the semantic of “bass” (), we calculate the attention score by comparing the distance between “bass” and other words () in the sequence: “I”, “play”, and “bass”. In this context, for given query “bass”, “play” will have higher attention score than “I” since “play” is a more important component to make “bass” as an instrument. As a result, the third frame of the output feature map, which is a position of “bass”, can have the context that the “bass” is a musical instrument. Since the attention score is computed from the sentence itself, it is called self-attention or intra-attention. The Transformer [39], which is a deep stack of self-attention, uses scaled dot-product attention to compute attention scores. This can be simply described as matrix multiplications:


where is a dimension of keys and , , are matrices whose shapes are .

We applied the self-attention to the feature map that we get from the front-end convolution. Suppose a convolution feature map is given after the front-end convolution of Spec or Raw and let or denote the feature map, where is channels, is time, and is frequency axis. For simplification, here we only explain with which is a feature map of Raw front-end. In this case, an vector of the feature map at each time bin can be regarded as a word embedding. Hence, , , and of the feature map can be denoted as:


where , , are learnable transformations. If we omit softmax and the scaling factor from the Eqn (1) and apply Eqn (2):


which is a simple matrix multiplication form. Figure 3 shows a single self-attention layer that we described.

While the Transformer [39] has encoder and decoder parts to tackle the machine translation task, the Bidirectional Encoder Representations from Transformers (BERT) [8]

only used encoders of the Transformer. Since our task is to classify, not to generate, we also only adopted the encoder part as BERT did. As shown in Figure

4, our proposed back-end uses stacks of self-attention to classify the tags of given sequence . [] is a special token that includes overall context for the classification. We call our back-end as Att back-end to connote Attention. Self-attention that we used is multi-head attention [39].

Figure 4: Att back-end with two self-attention layers.

Through this section, we described two front-ends: Spec and Raw; and three back-ends: CNN, CNN, and Att. We set Spec_CNN and Raw_CNN as our baseline, which are equivalent to the original implementation of [28] and [20], respectively. Then we experimented our back-end with Spec_Att and Raw_Att models.

3.4 Optimization

Careful design of learning rate schedule is critical to both of convergence speed and generalization [12, 33]. ADAM [17], an adaptive optimization method, achieves fast convergence but it is generally known to impede the generalization of models [14, 42]

. Instead of using conventional stochastic gradient descent (SGD) or ADAM, we propose an optimization technique inspired by the Switches from Adam to SGD (SWATS)


We first optimize the network using ADAM [17] with learning rate

, beta1 0.9, and beta2 0.999. After 60 epochs, we reload the model which achieved the best validation AUROC during the 60 epochs, and switch the optimizer to SGD with momentum 0.9 and nesterov momentum. We drop the learning rate by 10% at the epoch 80 and 100. In Section 

5.2, we show that our proposed mixed optimization scheme improves the generalization capacity than an SGD with manual learning rate scheduling. Note that our proposed method loads the best model weights during the training while SWATS [14] switches optimizer without changing the weights.

4 Dataset

4.1 MagnaTagATune

The MagnaTagATune (MTAT) dataset [19] consists of 26k annotated audio clips with 30s duration. We only used top-50 tags as proposed in the previous work [3] and followed the same data split of other research [3, 28, 20, 16]. Although the aforementioned works share the same data split, two recent works [20, 16] only used a refined subset. They removed songs that do not contain any of top-50 tags and 21k songs remained instead of 26k. Since this subset is more informative, we also used this in our experiments.

4.2 Million Song Dataset

For scalable research, we also explored a subset of the Million Song Dataset (MSD) with Last.FM tags [1]. Again, top-50 tags were selected [3] and audio clips shorter than 29.1s were discarded. As a result, 242k songs were available. We followed the data split of [20], [16] and [28].

4.3 Preprocessing

We investigate two different types of input: raw audio and log mel-spectrogram. For the comparable research, we decided to use 16kHz sampling rate for both inputs. Essentia library [2] was used to load and downsample the audio.

To get the log mel-spectrograms, hanning window of 512 samples with 50% overlap has been used and the number of mel-bins was set to 96. Librosa library [23] was used for this step. We did not normalize the dataset. Instead, CNN has batch normalization in the first layer.

5 Results

Front-end Back-end AUROC AUPR AUROC AUPR
Raw [20] CNN [20] 90.62 44.20 88.42 -
Raw [20] Att (Ours) 90.66 44.21 88.07 29.90
Spec [28] CNN [28] 90.89 45.03 88.75 31.24
Spec [28] Att (Ours) 90.80 44.39 88.14 30.47
Table 2: Comparison of state-of-art music tagging models on MTAT and MSD. The results marked with (*) on top are reported values from the reference papers.
# heads # layers AUROC AUPR
1 2 87.73 36.93
2 2 89.40 41.20
3 2 90.23 43.23
4 2 90.40 43.89
5 2 90.60 43.91
6 2 90.61 44.39
7 2 90.74 44.43
8 2 90.80 44.39
8 1 90.54 44.12
8 2 90.80 44.39
8 3 90.19 43.22
Table 3: Impact of the number of attention heads and layers on MTAT.

5.1 Quantitative Results

Following previous research [28]

, we report the Area Under Precision Recall curve (AUPR) along with conventional Area Under Receiver Operating Characteristic curve (AUROC). AUPR is known to be more informative to evaluate the algorithm’s performance when it deals with highly skewed datasets

[7]. Since we are using user-generated tags (MTAT and MSD), there is popularity biased skewness in their distributions. Although we are using AUROC to choose the best model, it’s not always the best in both metrics — see Table 3.

Table 2 shows AUROC and AUPR of the baseline models and our proposed models. Each value in the table is the average of three different runs. As shown in the table, our proposed Att back-end reports competitive results for both datasets.

Figure 5: Comparison of optimizers: ADAM, SGD, and our proposed method.
Input length # layers AUROC AUPR
256 2 90.80 44.39
1024 2 89.62 41.61
1024 3 89.85 42.25
1024 4 89.86 41.84
Table 4: AUROC and AUPR results on MTAT using proposed Spec_Att models with longer input sequence.

5.2 Ablation Study

Attention Parameters. Choosing an appropriate number of attention layers and heads can be crucial for designing better models. As shown in Table 3, attention layers more than 2 did not show significant improvement and 8 attention heads reported the best performance. Hence, we fixed the number of attention layers and attention heads in our experiments as 2 and 8, respectively. Note that this setup is optimized for 4.1s inputs.

Optimization. As we depicted in Section 3.4, we used our novel optimization method. By adopting ADAM [17] in the beginning, we expected faster convergence than SGD. As shown in Figure 5, ADAM and our optimization method show a steeper learning curve than SGD. However, AUROC and AUPR of ADAM go down after around 100 epochs, which means it failed to generalize the model. Since we switch our model to SGD at 60 epochs, it shows more stable learning curve than ADAM only. Although this switch point is an arbitrary point, our optimization method can generalize the model well because we load the best model during the training when we switch the optimizer or learning rate — we used AUROC to choose the best model.

Longer Sequence. In our main experiment, we only used relatively short audio chunks (4.1s) as our input for the fair comparison — sample-level CNN used short chunks. However, as we explained in Section 2.2, self-attention is known to be efficient to model long-term sequence. We experimented the Spec_Att model for MTAT using 1024 samples (16.4s) and we could see slightly lower but comparable results — see Table 4. More stacks of self-attention layers were required to model longer sequence.

(a) Tag - Beats
(b) Tag - Female
(c) Tag - Quiet
Figure 6: Attention heat maps. More results are illustrated in the appendix.
(a) Piano + Flute
(b) Techno + Classic
(c) Quiet + Loud
Figure 7: Tag-wise contribution heat maps on concatenated spectrograms. From the top, concatenated spectrograms, contribution heat maps to the first tags (Piano, Techno, and Quiet, respectively), and contribution heat maps to the second tags (Flute, Classic, and Loud, respectively). We report more results in the appendix.

Although self-attention is a powerful mechanism to model long sequential data, the amount of required memory increases quadratically by the sequence gets longer because we use dot-product attention.

In order to secure bigger size of receptive field, the Image Transformer [26] used local self-attention. The Compact Generalized Non-Local Network (CGNL) [43] approximated the calculation of self-attention via a trilinear equation with Taylor expansion. To capture longer-term context from music audio, we can utilize these techniques to reduce the complexity of the model efficiently.

5.3 Visualization

To interpret the proposed model, we provide two different visualization: attention heat map and tag-wise contribution heat map. While attention heat map shows where the trained model pays more attention, tag-wise contribution heat map highlights which part of the input spectrogram is more relevant to predict the given tag.

Attention Heat Map. To understand the behavior of the model, it is important to know which part of the audio the machine pays more attention to. To this end, we summed up attention scores from each attention head and visualized. Attention score of a single attention head can be described as:


Figure 6 shows log mel-spectrograms and according attention heat maps. For simplification, we only visualized the attention heat map of the last attention layer. As we can see in Figure 5(a) and Figure 5(b), the model pays more attention to relevant parts of spectrograms. However, we discovered one interesting thing which is: the model always pays attention to the parts with audio events. For example, in Figure 5(c), the model pays attention to the loud part of the audio although the given spectrogram was classified as “quiet”. We could also observe this behavior from negative tags such as “no vocal”, “no vocals”, and “no voice”. One possible reason is that the model pays attention to the more informative part of the spectrogram. Indeed, negative tags report relatively worse AUROC () than other tags (). Although attention heat maps can pinpoint where the machine pays attention for the decision, they cannot provide reasons for the classification or tagging.

Tag-wise Contribution Heat Map. Understanding which part of the audio is more relevant to each tag is also important to interpret the model. We manually changed the attention score of the last attention layer. For each time step, we manipulated the attention score as 1 and set other parts as 0 so that we can see the contribution of each time bin to each tag. This tag-wise contribution heat map is inspired by the manual attention weight adjustment proposed in [21]. To compare the different contribution of different audio, we concatenated two spectrograms and fed them through the network. For instance, Figure 6(a) is a concatenated spectrogram of piano (left half) and flute (right half). The first row heat map highlights the contribution of each time bin to the “piano” and the second row is for “flute”. We repeated this for genre (Figure 6(b)) and mood (Figure 6(c)). As shown in Figure 6(c), the tag-wise contribution heat map can provide more information about tag specific part of the audio, which was not able to be observed from the attention heat map (Figure 5(c)).

6 Conclusion

In this paper, we proposed a novel deep sequence model for music tagging which can facilitate better interpretability. The proposed model consists of CNN front-end and self-attention back-end. Experiments on MTAT dataset and MSD reported competitive results and we could demonstrate the interpretability of the model by visualizing attention heat maps and tag-wise contribution heat maps. By leveraging the acquired interpretation, one can obtain better intuition for the model design. Since proposed architecture is not task specific, it is expandable toward broad MIR tasks such as beat detection, rhythm classification, or music transcription.

7 Acknowledgement

This work was funded by the predoctoral grant MDM-2015-0502-17-2 from the Spanish Ministry of Economy and Competitiveness linked to the Maria de Maeztu Units of Excellence Programme (MDM-2015-0502). Also, we acknowledge that the experiments were carried out on NAVER Smart Machine Learening (NSML) GPU platform [36, 15].


  • [1] T. Bertin-Mahieux, D. P. Ellis, B. Whitman, and P. Lamere. The million song dataset. 2011.
  • [2] D. Bogdanov, N. Wack, E. Gómez Gutiérrez, S. Gulati, P. Herrera Boyer, O. Mayor, G. Roma Trepat, J. Salamon, J. R. Zapata González, and X. Serra. Essentia: An audio analysis library for music information retrieval. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2013.
  • [3] K. Choi, G. Fazekas, and M. Sandler. Automatic tagging using deep convolutional neural networks. Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2016.
  • [4] K. Choi, G. Fazekas, and M. Sandler. Explaining deep convolutional neural networks on music classification. arXiv preprint arXiv:1607.02444, 2016.
  • [5] K. Choi, G. Fazekas, M. Sandler, and K. Cho. Convolutional recurrent neural networks for music classification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.
  • [6] K. Choi, G. Fazekas, M. Sandler, and K. Cho. Transfer learning for music classification and regression tasks. Proceedings of the International Society of Music Information Retrieval Conference (ISMIR), 2017.
  • [7] J. Davis and M. Goadrich. The relationship between precision-recall and roc curves. In Proceedings of the International Conference on Machine learning (ICML), 2006.
  • [8] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • [9] C. Hawthorne, E. Elsen, J. Song, A. Roberts, I. Simon, C. Raffel, J. Engel, S. Oore, and D. Eck. Onsets and frames: Dual-objective piano transcription. arXiv preprint arXiv:1710.11153, 2017.
  • [10] C. Hawthorne, A. Stasyuk, A. Roberts, I. Simon, C.-Z. A. Huang, S. Dieleman, E. Elsen, J. Engel, and D. Eck. Enabling factorized piano music modeling and generation with the MAESTRO dataset. In Proceedings of the International Conference on Learning Representations (ICLR), 2019.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR)

    , 2016.
  • [12] E. Hoffer, I. Hubara, and D. Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), 2017.
  • [13] C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, I. Simon, C. Hawthorne, N. Shazeer, A. M. Dai, M. D. Hoffman, M. Dinculescu, and D. Eck. Music transformer. In Proceedings of the International Conference on Learning Representations (ICLR), 2019.
  • [14] N. S. Keskar and R. Socher. Improving generalization performance by switching from adam to sgd. arXiv preprint arXiv:1712.07628, 2017.
  • [15] H. Kim, M. Kim, D. Seo, J. Kim, H. Park, S. Park, H. Jo, K. Kim, Y. Yang, Y. Kim, et al. NSML: Meet the mlaas platform with a real-world case study. arXiv preprint arXiv:1810.09957, 2018.
  • [16] T. Kim, J. Lee, and J. Nam. Sample-level cnn architectures for music auto-tagging using raw waveforms. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.
  • [17] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), 2015.
  • [18] F. Krebs, S. Böck, M. Dorfer, and G. Widmer. Downbeat tracking using beat synchronous features with recurrent neural networks. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2016.
  • [19] E. Law, K. West, M. I. Mandel, M. Bay, and J. S. Downie. Evaluation of algorithms using games: The case of music tagging. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2009.
  • [20] J. Lee, J. Park, K. L. Kim, and J. Nam. Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. Proceedings of the Sound and Music Computing Conference (SMC), 2017.
  • [21] J. Lee, J.-H. Shin, and J.-S. Kim. Interactive visualization and manipulation of attention-based neural machine translation. In Proceedings of the conference on Empirical Methods in Natural Language Processing (EMNLP): System Demonstrations, 2017.
  • [22] Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio. A structured self-attentive sentence embedding. Proceedings of the International Conference on Learning Representations (ICLR), 2017.
  • [23] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto. librosa: Audio and music signal analysis in python. In Proceedings of the python in science conference, 2015.
  • [24] S. Mishra, B. L. Sturm, and S. Dixon. Local interpretable model-agnostic explanations for music content analysis. In Proceedings of the International Society for Music Infor-mation Retrieval Conference (ISMIR), 2017.
  • [25] S. Mishra, B. L. Sturm, and S. Dixon. Understanding a deep machine listening model through feature inversion. In Proceedings of the International Society for Music Infor-mation Retrieval Conference (ISMIR), 2018.
  • [26] N. Parmar, A. Vaswani, J. Uszkoreit, Ł. Kaiser, N. Shazeer, A. Ku, and D. Tran. Image transformer. Proceedings of the International Conference on Machine learning (ICML), 2018.
  • [27] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In Proceedings of the International Conference on Machine learning (ICML), 2013.
  • [28] J. Pons, O. Nieto, M. Prockup, E. Schmidt, A. Ehmann, and X. Serra. End-to-end learning for music audio tagging at scale. Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2018.
  • [29] J. Pons and X. Serra. Designing efficient architectures for modeling temporal features with convolutional neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.
  • [30] J. Pons, O. Slizovskaia, R. Gong, E. Gómez, and X. Serra. Timbre analysis of music audio signals with convolutional neural networks. In Proceedings of the European Signal Processing Conference (EUSIPCO), 2017.
  • [31] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. Technical report, Technical report, OpenAi, 2018.
  • [32] A. Roberts, J. Engel, and D. Eck.

    Hierarchical variational autoencoders for music.

    In NIPS Workshop on Machine Learning for Creativity and Design, 2017.
  • [33] S. Seong, Y. Lee, Y. Kee, D. Han, and J. Kim. Towards flatter loss surface via nonmonotonic learning rate scheduling. In

    Conference on Uncertainty in Artificial Intelligence (UAI)

    , 2018.
  • [34] S. Sigtia, E. Benetos, and S. Dixon. An end-to-end neural network for polyphonic piano music transcription. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016.
  • [35] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. Proceedings of the International Conference on Learning Representations (ICLR), 2015.
  • [36] N. Sung, M. Kim, H. Jo, Y. Yang, J. Kim, L. Lausen, Y. Kim, G. Lee, D. Kwak, J.-W. Ha, et al. NSML: A machine learning platform that enables you to focus on your models. arXiv preprint arXiv:1712.05902, 2017.
  • [37] D. Tang, B. Qin, and T. Liu. Document modeling with gated recurrent neural network for sentiment classification. In Proceedings of the conference on Empirical Methods in Natural Language Processing (EMNLP), 2015.
  • [38] A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu. Wavenet: A generative model for raw audio. Speech Synthesis Workshop (SSW), 2016.
  • [39] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), pages 5998–6008, 2017.
  • [40] R. Vogl, M. Dorfer, G. Widmer, and P. Knees. Drum transcription via joint beat and drum modeling using convolutional recurrent neural networks. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2017.
  • [41] X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [42] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht. The marginal value of adaptive gradient methods in machine learning. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), 2017.
  • [43] K. Yue, M. Sun, Y. Yuan, F. Zhou, E. Ding, and F. Xu. Compact generalized non-local network. Proceedings of the Advances in Neural Information Processing Systems (NIPS), 2018.
  • [44] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena. Self-attention generative adversarial networks. Proceedings of the International Conference on Machine learning (ICML), 2019.
  • [45] Z. Zuo, B. Shuai, G. Wang, X. Liu, X. Wang, B. Wang, and Y. Chen. Convolutional recurrent neural networks: Learning spatial dependencies for image representation. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2015.

Appendix A Per tag AUROC on MTAT

In Table 5, we report per tag AUROC on MTAT in a descending order. Note that our model is vulnerable to negative tags: ‘no voice’, ‘no vocal’, and ‘no vocals’.

metal choral choir rock opera
98.79 98.68 98.65 98.52 98.34
flute harpsichord cello techno dance
97.91 97.85 96.61 96.51 96.08
ambient piano harp country pop
95.69 95.69 95.12 94.05 94.01
sitar man woman female beat
93.99 93.70 93.53 93.41 93.23
female vocal male violin male vocal beats
92.99 92.91 92.62 92.34 92.25
classical loud female voice guitar quiet
92.15 91.30 91.05 90.98 90.89
solo drums indian male voice singing
90.36 89.99 89.99 89.79 89.77
electronic fast vocal new age classic
89.37 89.15 88.81 88.73 88.66
strings vocals synth voice slow
88.59 88.11 86.17 84.60 84.09
soft weird no vocal no vocals no voice
83.63 81.68 71.86 70.22 67.88
Table 5: Per tag AUROC on MTAT

Appendix B More Results on Attention Heat Maps

We report more attention heat maps of various types including voice (Figure 8), mood (Figure 9), instrument (Figure 10), and genre (Figure 11).

Appendix C Tag-wise Contribution Heat Maps

More tag-wise contribution heat maps are illustrated in Figure 12.

(a) Tag - Male
(b) Tag - Female
(c) Tag - Vocal
(d) Tag - No Vocal
Figure 8: Attention heat maps for voice tags.
(a) Tag - Quiet
(b) Tag - Loud
(c) Tag - Slow
(d) Tag - Fast
(e) Tag - Soft
(f) Tag - Weird
Figure 9: Attention heat maps for mood tags.
(a) Tag - Cello
(b) Tag - Sitar
(c) Tag - Harp
(d) Tag - Piano
(e) Tag - Strings
(f) Tag - Flute
(g) Tag - Drums
(h) Tag - Violin
(i) Tag - Guitar
(j) Tag - Synth
Figure 10: Attention heat maps for instrument tags.
(a) Tag - Classic
(b) Tag - Country
(c) Tag - Opera
(d) Tag - New Age
(e) Tag - Rock
(f) Tag - Metal
(g) Tag - Pop
(h) Tag - Dance
(i) Tag - Electronic
(j) Tag - Techno
Figure 11: Attention heat maps for genre tags.
(a) Female + Male
(b) Classic + Metal
(c) Vocal + No Vocals
(d) No Voice + Choir
(e) Slow + Fast
(f) Drums + Harp
Figure 12: Tag-wise contribution heat maps.