Current state-of-the-art systems for music auto-tagging using audio are based on deep learning, in particular convolutional neural networks (CNNs), following two different approaches, one directly using the audio as input (end-to-end models)  and the other using the spectrograms as input [9, 4]. Previous works  suggest that two approaches can have a comparative performance when they are applied on large datasets.
We can distinguish two architectures for the spectrogram-based CNN solutions, depending on whether they use multiple convolutional layers of small filters [6, 3] or if they use multiple filter shapes [17, 16, 15]
. The former is borrowed from the computer vision field (VGG) and gives a good performance without prior domain knowledge, while the latter is based on such a knowledge and employs filters designed to capture information relevant for music auto-tagging such as timbre or rhythm. Commonly mel-spectrograms are used with such architectures although constant-Q [13, 4]
and raw short-time Fourier transform (STFT) can be also applied.
In this paper, we compare the performance of two state-of-the-art CNN approaches to music auto-tagging [3, 15] using different mel-spectrogram representations as an input. We study how reducing the size of the input spectrograms in terms of both lesser amount of frequency bands and larger frame rates affects the performance. We show that by reducing the frequency and time resolution we can train the network faster with a small decrease in the performance. The results of this study can help to build faster CNN models as well as reduce the amount of data to be stored and transferred optimizing resources when handling large collections of music.
2 Related work
Only few previous studies compared different spectrogram representations for CNN architectures. Instead, it is common to focus on tuning model hyper-parameters with a fixed chosen input. The choice of the spectrogram input is done empirically and often follows approaches previously reported in literature. Very few information comparing different inputs is available as the authors tend to only report the most successful approaches. Also, as the existing studies on music auto-tagging focus on optimizing accuracy metrics, there is a lack of works that intend to simplify networks and their inputs for computational efficiency and consider practical aspects of the efficient ways to store spectrogram representations.
To the best of our knowledge, there is no systematic comparison of mel-spectrogram representations. The only work we are aware of in this direction has been done by Choi et al. , where the authors compare model performances under different pre-processing strategies such as scaling, log-compression, and frequency weighting. The same authors provide an overview of different inputs that can be used for the auto-tagging task in . In relation to mel-spectrograms, they suggest that one can optimize the input to the network by changing some of the signal processing parameters such as sampling rate, window size, hop size or mel bins resolution. These optimizations can help to minimize data size and train the networks more efficiently, however, no quantitative evaluations are provided.
Researchers in music auto-tagging commonly use the MagnaTagATune dataset  to evaluate multiple settings and then repeat some settings on Million Song Dataset  to validate differences in performances on a larger scale [9, 3, 15]. It is important to note that both datasets contain unbalanced and noisy and/or weakly-labeled annotations  and therefore are challenging to work with, as the reliability of conducted evaluations may be affected . Still, these are the two mostly used datasets for benchmarking due to the availability of audio.
3.1 MagnaTagATune (MTAT)
MagnaTagATune dataset contains multi-label annotations of genre, mood and instrumentation for 25,877 audio segments. Each segment is 30 seconds long, and the dataset contains multiple segments per song. All the audio is in MP3 format with 32 Kbps bitrate and 16 KHz sample rate. The dataset is split into 16 folders, and researchers commonly use the first 12 folders for training, the 13th for validation and the last three for testing. Also, only 50 most frequent tags are typically used for evaluation. These tags include genre and instrumentation labels, as well as eras (e.g., ’80s’ and ’90s’) and moods.
3.2 Million Song Dataset (MSD)
The MSD  is a large dataset of audio features, expanded by the MIR community with additional information including tags, lyrics and other annotations. It also contains a subset mapped by researchers to 30 seconds audio previews available at 7digital and collaborative tags from Lastfm. This subset contains 241,904 annotated track fragments and is commonly used as another larger scale benchmark for music auto-tagging systems. The tags cover genre, instrumentation, moods and eras. Audio fragments vary in their quality, encoded as MP3 with a bitrate ranging from 64 to 128 Kbps and the sample rates of 22 KHz or 44 KHz.
4 Baseline architectures
In this work, we reproduce two CNN architectures applying them on mel-spectrograms with reduced frequency and time resolution. These architectures are among the best performing according to the existing evaluations on the MTAT and MSD datasets:
VGG applied for music (VGG-CNN) . This architecture contains multiple layers of small-size 2D-filters as it has been adapted from the computer vision field . It is a fully-convolutional network consisting of four convolutional layers with small 33 filters111Number of mel bands number of frames.
and max pooling (MP) settings presented in Table1. The network operates on 96-bands mel-spectrograms for 29.1s audio segments, 12 KHz sample rate, 512 samples frame size and the hop size of 256 samples.
Musically-motivated CNN (MUSICNN) . The architecture contains more filters of different shapes designed with an intention to capture musically relevant information such as timbre (381, 383, 387, 861, 863, 867) and temporal patterns (132, 164, 1128, 1
165) in the first layer. The convolution results are concatenated and passed to three additional convolutional layers including residual connections.222We refer the readers to the original paper for all architecture details. Original network operates on 96-bands mel-spectrograms computed on smaller 15s audio segments with 16 KHz sample rate, 512 samples frame size and 256 samples hop size.333Frame and hop size settings are confirmed in personal communication with the author. It then averages tag activation scores across multiple segments of the same audio input.
For evaluation on MTAT and MSD, we use batch normalization, Adam
as optimization method with a learning rate of 0.001 and binary cross-entropy as loss function for both architectures following their authors.
|Input||Mel-spectrogram (961366 1)|
|Layer 1||Conv 33128|
|MP (2, 4) (output: 48341128)|
|Layer 2||Conv 33384|
|MP (4, 5) (output: 2485384)|
|Layer 3||Conv 33768|
|MP (3, 8) (output: 1221768)|
|Layer 4||Conv 332048|
|MP (4, 8) (output: 112048)|
We computed mel-spectrograms using typical setting for the MTAT dataset in the state of the art [3, 15]. The most common settings are 12 KHz or 16 KHz sample rate, frame and hop size of 512 and 256 samples, respectively, and Hann window function. Commonly, 96 or 128 mel bands are used, covering all frequency range below Nyquist (6 KHz and 8 KHz, respectively) and computed using Slaney’s mel scale implementation . To normalize the mel-spectrograms we considered two log-compression alternatives denominated as “dB” for  and “log” for .
|sample rate||# mel||hop size||log type|
|12 KHz||128||log, dB|
|12 KHz||96||log, dB|
|12 KHz||48||log, dB|
|12 KHz||32||log, dB|
|12 KHz||24||log, dB|
|12 KHz||16||log, dB|
|12 KHz||8||log, dB|
|16 KHz||128||log, dB|
|16 KHz||96||log, dB|
|16 KHz||48||log, dB|
|16 KHz||32||log, dB|
|16 KHz||24||log, dB|
|16 KHz||16||log, dB|
|16 KHz||8||log, dB|
Starting with these settings, we then considered different variations in frequency and time resolutions (smaller number of mel bands and larger hop sizes). Table 2 shows all different spectrogram configurations that we evaluated on the MTAT dataset. Each configuration results in a different dimension of the resulting feature matrix (the number of mel bands the number of frames). An audio segment of 29.1 seconds corresponds to 1366 and 1820 frames in the case of no temporal reduction () and the 12 KHz and 16kHz sample rate, respectively. In turn, the maximum reduction we considered () results in 137 and 182 frames.
All spectrograms were computed using Essentia444https://essentia.upf.edu music audio analysis library . It was configured to reproduce mel-spectrograms from another analysis library used by the state of the art, LibROSA,555https://librosa.github.io for compatibility. As a matter of interest, to have a better understanding of what information these spectrograms are able to capture, we provide a number of examples sonifying the resulting mel-spectrograms for all considered frequency and time resolutions online.666https://andrebola.github.io/ICASSP2020/demos
6 Baseline architecture adjustments
In this section we explain the changes introduced to the original model architectures presented in Section 4.
We try to preserve the original architecture defined in  in terms of the size and number of filters in each layer, but we need to adjust max pooling settings since we are reducing the dimensions of the mel-spectrogram input. We report all such modifications for the VGG-CNN architecture in Table 3
. It reports the sizes of square max-pooling windows in each layer selected accordingly to the number of mel bands and the hop size. We prioritize changes in max pooling in the latter layers when possible. We adjust the pooling size to match the input dimensions when possible, otherwise padding is applied. In the case of 16 KHz sample rate, more adjustments to VGG-CNN are necessary because, having a fixed reference hop size of 256 samples, the higher sample rate implies better temporal resolution and the larger mel-spectrograms (1820 frames).
It is important to note that if we change the resolution of the input, the 33 filters in VGG-CNN capture different ranges of frequency and temporal information. For example, they cover twice the mel-frequency range and a doubled time interval when using 48 mel bands and 2 hop size. This can be an advantage, because it reduces the amount of information that the network needs to learn.
|hop size||max-pooling size (time)|
|12 KHz||16 KHz|
|1||4, 5, 8, 8||4, 5, 9, 10|
|2||4, 5, 8, 4||4, 5, 9, 5|
|3||4, 5, 8, 2||4, 5, 9, 3|
|4||4, 5, 8, 2||4, 5, 9, 2|
|5||4, 5, 8, 1||4, 5, 9, 2|
|10||4, 5, 4, 1||4, 5, 9, 1|
|# mel||max-pooling size (frequency)|
|128||2, 4, 4, 4|
|96||2, 4, 3, 4|
|48||2, 4, 3, 2|
|32||2, 2, 3, 2|
|24||2, 2, 3, 2|
|16||2, 2, 2, 2|
|8||2, 2, 2, 1|
In the original model, timbre filters’ sizes in frequency are computed relative to the number of mel bands (90% and 40%). We preserve the same relation when we change this number. In our implementations we modified the segment size to 3 seconds, as we obtained slightly better results in our preliminary evaluation.777Similar to suggestions by other researchers reproducing this model. We keep the temporal dimension of the filters (the number of frames) intact for all considered mel-spectrograms settings.
7 Evaluation Metrics
CNN models for auto-tagging output continuous activation values within for each tag, and therefore we can study the performance of binary classifications under different activation thresholds. To this end, following previous works [14, 15, 3] we use Receiver Operating Characteristic Area Under Curve (ROC AUC) averaged across tags as our performance metric. We also report Precision-Recall Area Under Curve (PR AUC), because previous studies  have shown that ROC AUC can give over-optimistic scores when the data is unbalanced, which is our case. Both ROC AUC and PR AUC are single value measures characterizing the overall performance, which allows to easily compare multiple systems.
To measure the computational cost of models’ training and inference we use an estimate of the number of multiply-accumulate operations required by a network to process one batch (1 GMAC is equal to 1 Giga MAC operations). This metric is related to the time a model requires for training and inference. We use an online tool888https://dgschwend.github.io/netscope/quickstart.html to compute approximate MAC values for our architectures.
We evaluated the considered mel-spectrogram settings on the adjusted CNN models. Full results for all evaluated configurations are available online.999https://andrebola.github.io/ICASSP2020/results In Figure 1 we show the results of the evaluation for VGG-CNN on the MTAT dataset, repeated three times for each configuration. The first two plots show the ROC AUC results for the 12 KHz and 16 KHz sample rate using the log and dB scaling. Similarly, the third and forth plots show the PR ROC results under the same conditions. The last plot shows GMAC.
The results show that using some of the settings we can reduce the size of the input in frequency and time without affecting much the performance of VGG-CNN on the MTAT dataset. For example, if we reduce the frequency resolution from 96 to 48 mel bands we can reduce the MAC operations near 50% without affecting the performance in all configurations. Similarly, we can also reduce time resolution by 50% without affecting performance, and in this case we also reduce the MAC operations by 50% in all configurations. We can further reduce the number of operations by the cost of some performance decrease. This can be especially useful for applications requiring lightweight models, as we can get a model 10 faster by sacrificing between 1.4 and 1,8% of the performance depending on the configuration. Interestingly enough, both ROC AUC and PR AUC slightly improve when using 48 mel bands compared to 96 bands in most of the cases, however no statistically significant difference was found (
for all corresponding configurations in an independent samples t-test).
For the MUSICNN model, we have tested some of the configurations reported in Table 4. We only considered the frequency resolution reduction to 48 mel bands and no hop size increments due to significantly slower training time (see Section 4). The results show comparable performance of 96- and 48-band mel-spectrograms and are consistent with the above mentioned findings for the VGG-CNN model. Overall, using 128 mel bands resolution provided the best performance. Also, according to the results, the MUSICNN architecture outperforms VGG-CNN, which is consistent with the reports from the authors.
To check how our findings scale, we selected a number of configurations and re-evaluated the models on the MSD dataset. The results are reported in Table 5. In the case of VGG-CNN we can see that the performance of the baseline architectures is slightly superior to the ones working with lower-resolution mel-spectrograms, which comes by cost of a significantly larger computational effort. For example, for the 12 KHz sample rate, 1 hop size and dB compression settings, reducing the number of mel bands from 96 to 48 results in the decrease is 0.16% in the ROC AUC performance and 50% reduction in GMACs. For a similar 16 KHz/dB case the reduced model has the same performance with the benefit of twice as low computational speed.
In the case of the MUSICNN architecture we see a reduction of the performance of 0.19% if we compare 96 vs 48 mel bands using 12 KHz sample rate and 0.11% for 16 KHz.
|# mels||sample rate||ROC AUC||PR AUC|
In this paper we have studied how different mel-spectrogram representations affect the performance of CNN architectures for music auto-tagging. We have compared the performances of two state-of-the-art models when reducing the mel-spectrogram resolution in terms of amount of frequency bands and frame rates. We used the MagnaTagaTune dataset for comprehensive performance comparisons and then compared selected configurations on the larger Million Song Dataset. The results suggest that is possible to preserve a similar performance while reducing the size of the input. They can help researchers and practitioners to make trade-off decision between accuracy of the models, data storage size and training and inference time, that may be crucial in a number of applications.
As a future work, other approaches for the reduction of the input data dimensionality and size will be considered, for example quantization of mel-spectrogram values. The conducted evaluation can be also extended to other state-of-the-art architectures such as 
. It is also promising to conduct a similar evaluation on other audio auto-tagging tasks. All the code to reproduce this study is open-source and available online.101010https://andrebola.github.io/ICASSP2020/
-  (2011) The million song dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR), Cited by: §3.2, §3.
-  (2013) ESSENTIA: an audio analysis library for music information retrieval. In Proceedings of the 14th International Society for Music Information Retrieval Conference (ISMIR), pp. 493–498. Cited by: §5.
-  (2016) Automatic tagging using deep convolutional neural networks. In Proceedings of the 17th International Society for Music Information Retrieval Conference (ISMIR), pp. 805–811. Cited by: §1, §1, §3, 1st item, §5, §6.1, §7.
-  (2017) A tutorial on deep learning for music information retrieval. arXiv preprint arXiv:1709.04396. Cited by: §1, §1, §2.
-  (2018) The effects of noisy labels on deep convolutional neural networks for music tagging. IEEE Transactions on Emerging Topics in Computational Intelligence 2 (2), pp. 139–149. Cited by: §3.
Convolutional recurrent neural networks for music classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2392–2396. Cited by: §1, §9.
-  (2018) A comparison of audio signal preprocessing methods for deep neural networks on music tagging. In 2018 26th European Signal Processing Conference (EUSIPCO), pp. 1870–1874. Cited by: §1, §2, §5.
The relationship between precision-recall and roc curves.
Proceedings of the 23rd international conference on Machine learning, pp. 233–240. Cited by: §7.
-  (2014) End-to-end learning for music audio. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6964–6968. Cited by: §1, §3, §5.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
-  (2009) Evaluation of algorithms using games: the case of music tagging. In Proceedings of the 10th International Society for Music Information Retrieval Conference (ISMIR), pp. 387–392. Cited by: §3.
-  (2017) Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. In Proceedings of the 14th Sound and Music Computing Conference (SMC), Cited by: §1.
-  (2018) Multimodal deep learning for music genre classification. Transactions of the International Society for Music Information Retrieval. 2018; 1 (1): 4-21.. Cited by: §1.
Multi-label music genre classification from audio, text and images using deep features. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), pp. 23–30. Cited by: §7.
-  (2018) End-to-end learning for music audio tagging at scale. In Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), pp. 637–644. Cited by: §1, §1, §1, §3, 2nd item, §5, §7.
-  (2017) Designing efficient architectures for modeling temporal features with convolutional neural networks. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2472–2476. Cited by: §1.
-  (2017) Timbre analysis of music audio signals with convolutional neural networks. In 2017 25th European Signal Processing Conference (EUSIPCO), pp. 2744–2748. Cited by: §1.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1, 1st item.
-  (1998) Auditory toolbox. Interval Research Corporation, Tech. Rep 10 (1998). Cited by: §5.
-  (2012) A survey of evaluation in music genre recognition. In International Workshop on Adaptive Multimedia Retrieval, pp. 29–66. Cited by: §3.