Automatic music genre classification (AMGC) is the task of labelling music tracks according to their genre (e.g., rock, reggae, and classical) [tzan2002]
. Audio-based music genre classifiers have received many contributions in the last two decades, including the development of new texture descriptors[tao2003, peeters2004, lidy2005, manaris2008]
and novel neural network architectures[choi2017, yandre2017, yu2019]. They all rely on the assumption that each music genre can be characterised by their use of specific techniques and instruments, which lead to genre-specific sets of auditory textures [tzan2002]
. For this reason, most music genre classifiers rely on mapping each musical track from a collection to a point within a feature-based vector space whose topology can represent the perceptual similarities between tracks.
Track similarity was modelled in earlier work on AMGC using features inspired on domain-knowledge relevant concepts. Such handcrafted feature sets have been used as basis for a wide range of automatic music genre classifiers [tzan2002, tao2003, peeters2004, lidy2005, manaris2008, shin19, kobayashi18]. However, their development involves considerable effort on modelling perceptual or musical characteristics of audio signals.
This drawback has been approached in the last few years by contributions towards feature learning using deep neural networks [nanni2016, choi2016, yandre2017, oramas17, choi17b]. These methods rely on learning features that optimally correlate to the target labels. This process depends on a high amount of computational power for processing, which leads to higher hardware requirements.
In this paper, we evaluate a different paradigm for generating low-level features, namely the use of random projections of Mel-spectrograms. For such, we use a simple classification pipeline that begins with low-level feature extraction, proceeds to feature aggregation and ends with a vector classification. This pipeline was executed over five publicly available datasets, showing that, in commercial music datasets, random projection results in performance comparable to learned features, and outperform both handcrafted features and features obtained by transfer learning techniques.
These results indicate that random projections can be used in media organisation problems related to user customisation. In these problems, it can be unfeasible to execute feature learning techniques due to processing power and energy constrains. These problems can be overcome by using random projections, leading to consistent results while requiring significantly lower processing power.
Random projections have been used in previous work on automatic genre classification, specifically in Extreme Learning Machines (ELMs) during the vector classification stage [Scardapane2014, loh2006, baniya2015]. Also, work by Chang et al. [chang2010] has proposed using random projections of handcrafted features as basis for automatic music genre classification. Chang et al.’s [chang2010] work was later commented by Sturm [sturm2013], who showed that the random projections do not add classification power to the handcrafted features themselves.
A similar idea, studied by Choi [choi17b]
, is to set the weights of a convolutional neural network to random numbers. This allowed highlighting a small performance gain obtained by learning features from other domains.
This work differs from both ELMs [Scardapane2014, loh2006, baniya2015] and Chang et al.’s [chang2010] work because it proposes using random projections to generate the low-level features that are later fed to a vector classification pipeline. Also, differently from Choi’s [choi17b] work, it is not bounded to a specific classification algorithm, thus it can be immediately applied in future research.
The results obtained in this work show that the low-level random projection features lead to better classification results than the original Mel-Spectrogram. This indicates that this usage of random projections is not harmed by the effects detected by Sturm [sturm2013].
Finally, we evaluated using learned features in a cross-dataset scenario. In these tests, the results were highly degraded, and the proposed random projections outperformed the learned features.
The remainder of this paper is organised as follows. Section 2 presents a brief overview of the random projection theory applied to classification. Section 3 presents our approach to evaluate random projections of Mel-Spectrograms as features for AMGC. Section 4 presents the datasets used in the experiments, along with some remarks regarding fold creation procedures. Section 5 presents results and discussions on the performance of random features in five different datasets and how they compare to other feature sets. Finally, Section 6 presents some closing remarks.
2 Random Projections
The effectiveness of random projections for dimensionality reduction is known well-known in the machine learning literature[baraniuk2010]. The Johnson-Lindenstrauss (JL) lemma [johnson1984]
states that a random matrix, where , projects a matrix into a stable embedding
with high probability if, [johnson1984]. Figure 1 shows this transformation. In a machine learning scenario, is analogous to the number of feature vectors, is the original dimensionality, and is the dimensionality of the embedding. The constant controls the distortion introduced by the random transformation. It follows that, as more distortion (higher ) is allowed, the smaller can be.
When data is projected into a stable embedding, the distances among the points in the embedding are preserved in relation to the data in the original domain [baraniuk2010]. Figure 1 shows how the relative distance among all points are preserved. This property enables vector-based classification to be performed in the embedding domain [davenport2007]. In addition, the dimension reduction caused by the projection (when
) reduces the computational effort and alleviates the curse of dimensionality.
Baraniuk et al. [Baraniuk2008] have shown a link between the JL lemma and the Restricted Isometry Property (RIP) of compressive sensing theory. They state that random matrices satisfy the RIP property as a consequence of the JL lemma. Also, the RIP property guarantees that a sparse signal can be recovered from its sampled form using less components than the lower bound introduced by the JL lemma [Calderbank2009, candes2008]
Mel-Spectrograms can be considered sparse, because many components of each frame are close to zero. Hence, Mel-Spectrograms are fit for compression into a lower-dimensional space using random projections. Also, since the final objective is classification, instead of signal reconstruction, it can be expected [lohit2015] to be possible to use fewer dimensions in the embedding domain () than the lower bound presented by Candes and Wakin [candes2008].
3 Proposed Method
The classification method used in our work consists on mapping each track to a vector representation. This representation aims at preserving auditory similarities, that is, tracks that sound similar should be mapped to vectors that are close to each other in feature space. For such, we used four different low-level feature sets.
All feature sets were calculated using a ms STFT with a Hanning window and
% overlap between subsequent windows. The first and second order differentials of each feature were also calculated. Then, each feature is aggregated to texture-level frames using mean and variance calculated over as-long sliding window. Last, the mean and variance of the mean and variances are used as features for classification.
The first feature set used comprised subset of MARSYAS [tzan2002], which included Energy, Spectral Centroid, Spectral Rolloff, Spectral Flatness, Spectral Flux, Zero-Crossing Rate, and the first 20 MFCC coefficients. Theses feature were handcrafted based on domain-specific knowledge.
We also evaluated features learned using an auto-encoder over the Mel-scale spectrogram (MEL-AE). For such, we used the activations from the bottleneck layer as features. The architecture consisted of an input layer with one input for each MEL-SPEC bin, followed by a fully-connected layer with
units using ReLU as the activation function. The final layer is a fully connected layer using a linear activation function containing a unit for each MEL-SPEC bin. The Auto-Encoder network was trained usingNesterov momentum with learning rate and momentum , samples per batch for a maximum of epochs, using an early stopping criterion of epochs with no improvement. is a parameter which we tested using units.
This procedure was also used to generate features used in transfer learning settings. In these settings, we trained the auto-encoder using a dataset and then performed classification experiments using a different dataset. The training and testing procedures comprised the GTZAN and LMD datasets, because they contain tracks with non-overlapping labels.
Our proposal, MEL-RP, consists of using a random projection of the Mel-scale spectrogram as a feature set. The projection matrix was drawn element-wise from a Gaussian distribution with zero-mean, unit-variance. The target dimensionalitywas tested for .
For baseline purposes, we used the 128-bin Mel-Spectrogram (MEL-SPEC) itself as a frame-level feature set. Last, we used a PCA-based feature reduction of the Mel-Spectrogram (MEL-PCA). The target dimensionality for the PCA-based feature reduction was, the same set as the experiments done with the random projection.
For classification, we tested a Support-Vector Machine (SVM) and a K-Nearest Neighbors (KNN) classifier. Their hyper-parameters were adjusted with a 80-20 train/validation scheme in the training set. The SVM used a RBF kernel, C was optimised over, and gamma was set to . The KNN had its K parameter optimised over
. After the hyper-parameter estimation, the whole training set is used to train the highest performing model for each classifier.
Before training the classifier, the training set is normalised so that all features are centred at zero mean and unit variance. Test samples were normalised using the same parameters used on the training set.
The datasets used in the experiments are shown in Table 1. All datasets were resampled to 44100Hz and mixed into monaural tracks by averaging the stereo signals. The experiments were conducted using specific train-test splits for each dataset, allowing comparison with previous work using the same datasets. Since many datasets have repeated songs from the same artist, we applied an artist filter [pampalk2005] when creating the cross-validation splits for every dataset to prevent modelling artist-specific (instead of genre-specific) characteristics.
|Dataset||Tracks||Classes||Balance||Clip Len.||CV Splits|
Problems with the GTZAN [tzan2002] dataset are well-known to the MIR community [sturm2013b]. In this work, we followed the instructions proposed by Sturm [sturm2013b] to minimise these problems. Such instructions include correcting mislabelled songs and maximally using the artist filter. The resulting folds are available online111https://github.com/julianofoleiss/gtzan_sturm_filter_3folds_stratified.git to allow scientific reproducibility.
The LMD [silla2008] dataset is also known to have problems regarding artist repetition and the usage of entire album in the repertoire. We used a subset of LMD that addresses these problems by applying the artist filter and an album filter, which prevents songs with the same production characteristics from being both in the training and test sets.
We did not use a cross-validation protocol with the ISMIR [ismir2004] dataset because it was published with a train/test split that makes comparison to other works straightforward. We also used the HOMBURG [homburg2005] dataset, which is made up of short clips and is also known for being difficult. Finally, the Extended BALLROOM [marchand2016] dataset was used to test the feature sets with a larger dataset containing subsets of genres that are more similar in relation to their perceptual characteristics.
5 Results and Discussion
The experimental results shown in this section allow comparing the classification performance of random features to learned and handcrafted ones. Also, they highlight the impact of changing the number of features in both the random projections and auto-encoder learning settings. Last, the results regarding transfer learning settings allow comparing random features to transferred features.
Table 2 shows the best results obtained for all datasets and feature sets, which were consistently obtained using the SVM. All the results presented in this paper are weighed F1-Scores, that is, the weighed average of per-class F1-scores. It can be seen that, in general, MEL-RP and MEL-AE features perform better MARSYAS features for all datasets, except in EXBALLROOM.
Random projections cannot incorporate information from other domains. However, the performance improvement when comparing MEL-SPEC to MEL-RP is consistent. This means that the trade-off between the dimension reduction and the projection distortion was positive for the classification process. Results show that this trade-off could not be achieved by the PCA projection.
Both MEL-AE and MEL-RP features are not necessarily related to musical or auditory characteristics. However, since they were used in a simple, similar classification pipeline, these results reflect their frame-level descriptive capabilities from a machine learning perspective.
|GTZAN||0.49 0.06||0.62 0.05||0.59 0.05||0.68 0.06||0.22 0.07|
|LMD||0.42 0.02||0.77 0.01||0.66 0.03||0.77 0.02||0.26 0.03|
|HOMBURG||0.43 0.03||0.49 0.02||0.41 0.02||0.53 0.04||0.24 0.02|
|EXBALLROOM||0.35 0.03||0.54 0.03||0.67 0.02||0.55 0.03||0.21 0.02|
Even though there is a clear improvement when using MEL-RP over MEL-SPEC in EXBALLROOM, the best results were achieved with MARSYAS features. This can be related to the fact that EXBALLROOM was built using subsets of tracks with a high similarity in their timbre characteristics. Within the dataset, genre is only distinguishable with respect to time-aware descriptions such as rhythm and tempo-related features. Because MEL-RP, MEL-SPEC and MEL-AE features are based solely on timbre characteristics, they are not able to describe time-dependent features.
The impact of the number of features in the classification performance was also measured, as shown in Figure 2. The number of features shown is different from the number of features that were yielded to the classifiers, which is 12 times greater (because of the differentials, means and variances).
Figure 2 shows that, in general, performance increases as the number of features rises for both MEL-AE and MEL-RP using SVM and KNN. However, this behaviour saturates around 50 to 100 features, leading to result saturation.
Except for Extended Ballroom, MEL-AE features achieve the best results in all feature sets. However, MEL-RP features lead to comparable results. Also, changing from SVM to KNN in the machine learning pipeline consistently shows a greater impact in the results than changing the feature set from MEL-AE to MEL-RP. Interestingly, for ISMIR and LMD, MEL-RP with KNN performs even better than MEL-SPEC with SVM, which further highlights the relevance of the random projection.
The results regarding transfer learning settings are shown in Table 3. It can be seen that learned features lead to a significant performance drop when used in a shallow-learning scenario. This indicates that in this case features learned from a dataset are not necessarily relevant in other datasets. Also, it can be seen that the performance drop caused by using learned features is larger than the drop related to using random features (as shown in Table 2).
|Target / Source||GTZAN||LMD||GTZAN+LMD|
|GTZAN||0.68 0.06||0.51 0.05||0.53 0.05|
|LMD||0.66 0.04||0.77 0.02||0.66 0.04|
In this work we have introduced random projection of Mel-Spectrograms (MEL-RP) as a feature set in the context of automatic music genre classification. Our results show that MEL-RP achieves results comparable to those obtained using a feature learning approach. MEL-RP, however, has the advantages of not requiring feature learning mechanisms. This reduces computing requirements during training, because the generation of a suitable random matrix is straightforward.
In perspective to handcrafted features, MEL-RP has the advantage of requiring less domain-specific knowledge. Additionally, MEL-RP outperforms MARSYAS features in most datasets. However, it leads to worse results in the EXBALLROOM dataset, whose genre classification is highly linked to rhythm properties.
Also, MEL-RP has shown to perform better than features obtained by transfer learning. This can be due to the fact that switching datasets can lean to changing the texture distribution. Such a change harms assumptions related to building an auto-encoder, but have no impact on the separation properties of random projections. This indicates that in shallow-learning systems MEL-RP is more suitable for customisation applications than using features transferred from different datasets.