Music classification and style tagging is a traditional problem in music information retrieval (MIR) which entails predicting specific tags of a song, including genre, emotion, era, and instrumentation. Two main types of neural networkss used in music classification algorithms include convolutional neural networks (CNN) and recurrent neural networks (RNN), both of which were initially designed to facilitate different problems in image and language processing. However, the techniques mentioned above proved to be robust enough to be transferred to other applications successfully such as music tagging.
State-of-the-art computer vision algorithms use a technique called multi-scale feature pyramid network (FPN), which is capable of carrying low-level features from the lower layer to the higher layers in a neural networks. Concatenating these multi-scale features leads to improved accuracy(Kirillov et al., 2019; Lin et al., 2017; Seferbekov et al., 2018; He et al., 2017)
. Furthermore, U-Net is another novel architecture for image segmentation that consists of both a contracting path to capture context and a symmetric expanding path that enables precise localization. Having Mel-spectrum as an input to a CNN, a music classification is similar to the image classification task. We propose to use a multi-scale feature extraction structure in music classification, and we show that it can improve the accuracy of the model.
Many music tagging algorithms use spectrograms as input (Choi et al., 2016; Pons et al., 2017b; Pons & Serra, 2017; Choi et al., 2017), while some others use raw audio files to extract relevant features on their own (Dieleman & Schrauwen, 2014; Lee et al., 2017; Zhu et al., 2016). The successful performance of CNN in computer vision tasks led audio researchers to combine the use of Mel-spectrogram and CNN. Dieleman et al. compared the use of spectrograms and raw audio and reported a better performance in spectrograms (Dieleman & Schrauwen, 2014). In this work, we focus on spectrograms as input to CNN and will build a new model upon currently available architectures to achieve improved performance.
2 Development of the architecture
In this study, we developed the architecture based on fully convolutional networks (FCN) as proposed in (Choi et al., 2016). We aim to improve their model by adding intermediate connections to the structure. In our proposed model, we add links in between every two convolutional layers in the FCN architecture, which allows transferring the features from these layers to deeper layers to combine low-level features with high-level features.
2.1 Fully convolutional network
Choi et. al. (Choi et al., 2016)
proposed FCNs with a different number of layers for music tagging and Figure 1 shows a five-layer FCN (FCN-5) version of their model. We selected this version because it showed the best results when applied to both MagnaTagATune and Million Song datasets. It includes five convolutional and max-pooling layers. The input to this model is a Mel-spectrogram of size, and more details are shown in Figure 1. Pooling size increases at each level, to increase the receptive filed gradually.111https://github.com/keunwoochoi/music-auto_tagging-keras
2.2 Multi-scale Embedded CNN (MsE-CNN)
We propose that, by utilizing intermediate connections in our CNN architecture, we can carry important multi-scale features to the last layer for improved classification. We claim that such an approach will allow the model to learn low-level features such as musical texture and timbre as well as high-level temporal characteristics to improve the classification.
Figure 2 demonstrates our proposed architecture. At each level, we preserve the features before passing them through a convolutional layer, then concatenate them with results of the convolution by passing them through a max pooling operation. Therefore, at the output level, we have almost eight times as many features as FCN-5, while the increase in time complexity is negligible.
3 Experiment description
In this study we used MagnaTagTune (MTT) dataset (Law et al., 2009) to tag songs and evaluate the performance of the model using both Area Under the Receiver Operating Characteristics (ROC-AUC) and Area Under the Precision-Recall Curve (PR-AUC). These methods have been used in literature extensively (Nam et al., 2015; Dieleman & Schrauwen, 2014; Van Den Oord et al., 2014; Dieleman & Schrauwen, 2013; Hamel et al., 2011) and allow us to perform a fair comparison of our model, MsE-CNN, with other architectures.
|FCN-4 222Taken from (Choi et al., 2016; Pons et al., 2017a, b; Nam et al., 2015; Dieleman & Schrauwen, 2014; Van Den Oord et al., 2014; Dieleman & Schrauwen, 2013; Hamel et al., 2011). ,333The PR-AUC results are based on reproduced version of the algorithm in (Pons et al., 2017a).||0.894||0.376|
|End to End Learning 2||0.904||0.381|
|Timbre CNN 2,3||0.893||0.349|
|Bag of features and RBM 2||0.888||-|
|1D convolutions 2||0.882||-|
|Transferred learning 2||0.88||-|
|Multi-scale approach 2||0.898||-|
|Pooling MFCC 2||0.861||-|
Using spectral representation of raw audio for music tagging shows significant improvement while utilizing CNN models. However, due to the complexity of timbral characteristics of sounds changing over time, the audio classification is considered different from image classification task. The change of timbral quality over time affects our perception of music, which is critical to the understanding of the genre, mood, instrumentation. We propose that in spectral representation of music, timbre is equivalent to texture and color in an image while long-term temporal structures are equivalent to shapes such as eye, hand, and nose. Similar to a vision task, learning musical textures and timbre as well as long-term characteristics is crucial in audio classification.
We believe a traditional CNN model learns long-term structures; however, it forgets timbral features in the final classification step. Similar to the intermediate connection in U-Net and FPN, one can improve music classification by transferring low-level characteristics via similar connections. Our model, MsE-CNN, is an experiment to support the idea that timbral fluctuation over time is disregarded while going deeper in a CNN model. Due to the larger receptive field in later layers in a CNN, the model starts to forget low-level features that carry a lot of textual details which are indeed crucial for audio classification.
Our experiment is more of a feasibility test to support our assumption; hence, it can be studied in greater details to find the optimal architecture. Indeed, such an approach can be extended to a variety of applications such as timbre analysis, instrument classification, audio clustering, and automatic music structure segmentation. In our future work, we aim to address a variety of MIR tasks to propose an optimal framework for the proposed training procedure.
- Choi et al. (2016) Choi, K., Fazekas, G., and Sandler, M. Automatic tagging using deep convolutional neural networks. arXiv preprint arXiv:1606.00298, 2016.
- Choi et al. (2017) Choi, K., Fazekas, G., Sandler, M., and Cho, K. Convolutional recurrent neural networks for music classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2392–2396. IEEE, 2017.
- Dieleman & Schrauwen (2013) Dieleman, S. and Schrauwen, B. Multiscale approaches to music audio feature learning. In 14th International Society for Music Information Retrieval Conference (ISMIR-2013), pp. 116–121. Pontifícia Universidade Católica do Paraná, 2013.
- Dieleman & Schrauwen (2014) Dieleman, S. and Schrauwen, B. End-to-end learning for music audio. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6964–6968. IEEE, 2014.
- Hamel et al. (2011) Hamel, P., Lemieux, S., Bengio, Y., and Eck, D. Temporal pooling and multiscale learning for automatic annotation and ranking of music audio. In ISMIR, pp. 729–734, 2011.
- He et al. (2017) He, K., Gkioxari, G., Dollár, P., and Girshick, R. Mask r-cnn. ICCV, 2017.
- Kirillov et al. (2019) Kirillov, A., Girshick, R., He, K., and Dollár, P. Panoptic feature pyramid networks. CVPR, 2019.
- Law et al. (2009) Law, E., West, K., Mandel, M. I., Bay, M., and Downie, J. S. Evaluation of algorithms using games: The case of music tagging. In ISMIR, pp. 387–392, 2009.
- Lee et al. (2017) Lee, J., Park, J., Kim, K. L., and Nam, J. Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. arXiv preprint arXiv:1703.01789, 2017.
Lin et al. (2017)
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., and Belongie,
Feature pyramid networks for object detection.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125, 2017.
- Nam et al. (2015) Nam, J., Herrera, J., and Lee, K. A deep bag-of-features model for music auto-tagging. arXiv preprint arXiv:1508.04999, 2015.
- Pons & Serra (2017) Pons, J. and Serra, X. Designing efficient architectures for modeling temporal features with convolutional neural networks. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2472–2476. IEEE, 2017.
- Pons et al. (2017a) Pons, J., Nieto, O., Prockup, M., Schmidt, E., Ehmann, A., and Serra, X. End-to-end learning for music audio tagging at scale. arXiv preprint arXiv:1711.02520, 2017a.
- Pons et al. (2017b) Pons, J., Slizovskaia, O., Gong, R., Gómez, E., and Serra, X. Timbre analysis of music audio signals with convolutional neural networks. In 2017 25th European Signal Processing Conference (EUSIPCO), pp. 2744–2748. IEEE, 2017b.
- Seferbekov et al. (2018) Seferbekov, S. S., Iglovikov, V. I., Buslaev, A. V., and Shvets, A. A. Feature pyramid network for multi-class land segmentation. CVPR, 2018.
- Van Den Oord et al. (2014) Van Den Oord, A., Dieleman, S., and Schrauwen, B. Transfer learning by supervised pre-training for audio-based music classification. In Conference of the International Society for Music Information Retrieval (ISMIR 2014), 2014.
- Zhu et al. (2016) Zhu, Z., Engel, J. H., and Hannun, A. Learning multiscale features directly from waveforms. arXiv preprint arXiv:1603.09509, 2016.