Music tag words that describe music audio by text have different levels of abstraction. Taking this issue into account, we propose a music classification approach that aggregates multi-level and multi-scale features using pre-trained feature extractors. In particular, the feature extractors are trained in sample-level deep convolutional neural networks using raw waveforms. We show that this approach achieves state-of-the-art results on several music classification datasets.READ FULL TEXT VIEW PDF
Music auto-tagging is often handled in a similar manner to image
In this work, we aim to improve the expressive capacity of waveform-base...
Music, speech, and acoustic scene sound are often handled separately in ...
In this paper, we present a novel state of the art system for automatic
Deep convolutional neural networks (CNNs) have been actively adopted in ...
Artist recognition is a task of modeling the artist's musical style. Thi...
Convolutional neural networks (CNN) recently gained notable attraction i...
Learning hierarchical audio representations for music classification in an end-to-end manner is a challenge due to the diversity of music description words. In this study, we combine two previously proposed methods to tackle the problem.
Music classification tasks, particularly music auto-tagging among others, have a wide variety of labels in terms of genre, mood, instruments and other song characteristics. In order to address different levels of abstraction that the labels retain, we recently proposed an approach that aggregates audio features extracted in a multi-level and multi-scale manner(Lee & Nam, 2017)
. The method is composed of three steps: extracting features using pre-trained convolutional neural networks (CNNs), feature aggregation and song-level classification. The CNNs are trained in a supervised manner with the tag labels, taking different sizes of input frames. The feature aggregation step extracts multiple-level features using the pre-trained CNNs and summarizes them into a single song-level feature vector. The last step performs final predictions of tags from the aggregated features using a fully-connected neural network. This multi-step architecture has the advantage of capturing local and global characteristics of a song and also has a good accordance with transfer learning. However, our previous approach used mel-frequency spectrograms as input, which are based on the knowledge of pitch perception.
|(Lee & Nam, 2017)||
|(Lee et al., 2017)||Sample-level DCNN ( model)||-||0.9055||-||0.8812|
|Proposed Work||-3 layer (pre-trained with MSD)||0.778||0.8988||0.760||0.8831|
|(features from||-2 layer (pre-trained with MSD)||0.811||0.8998||0.768||0.8838|
|sample-level||-1 layer (pre-trained with MSD)||0.821||0.8976||0.768||0.8842|
|dcnn, model)||last 3 layers (pre-trained with MSD)||0.805||0.9018||0.768||0.8842|
|, , and models||0.9061|
|, , , , , , and models||0.9064|
We recently investigated the possibility of employing raw waveforms as input for deep convolutional neural networks (DCNNs) in music auto-tagging (Lee et al., 2017). They were configured to take a very small grain of waveforms, even 2 or 3 samples, in the bottom-level filters. They show that the “sample-level” representation learning works well and the learned filters in each layer are sensitive to log-scaled frequency along layer such as mel-frequency spectrogram.
In this study, we combine the two methods to take all the advantages of them. As illustrated in Figure 1, we used the top three hidden layers from the sample-level DCNNs for multi-level feature extraction. The DCNNs take different sizes of input.
We validate the effectiveness of the proposed method on different sizes of datasets for genre classification and auto-tagging. The details are as follows111https://github.com/jongpillee/music_dataset_split:
MagnaTAgaTune (MTAT) (Law et al., 2009): 21105 songs, auto-tagging (50 tags)
Million Song Dataset with Tagtraum genre annotations (TAGTRAUM, stratified split with 80% training data of CD2C version) (Schreiber, 2015): 189189 songs, genre classification (15 genres)
Million Song Dataset with Last.FM tag annotations (MSD) (Bertin-Mahieux et al., 2011): 241889 songs, auto-tagging (50 tags)
We obtained the results from the average of 10 experiments. From Table 1, although the proposed method failed to outperform the best of the previous works on MSD and MTAT, the multi-level and multi-scale aggregation generally improves the performance. The improvement is particularly dominant in GTZAN. From Table 2 where only MTAT is used, the proposed method is superior to the two previous works. Furthermore, we visualize the features at different levels for selected tags in the Figure 2. Songs with genre tag (Techno) are more closely clustered in the higher layer (-1 layer). On the other hand, songs with instrument tag (Piano) are more closely clustered in the lower layer (-3 layer). This may indicate that the optimal layer of feature representations can be different depending on the type of labels. All of these results show that the proposed feature aggregation method is also effective with the sample-level DCNNs.