Texture Selection for Automatic Music Genre Classification

by   Juliano H. Foleiss, et al.
University of Campinas

Music Genre Classification is the problem of associating genre-related labels to digitized music tracks. It has applications in the organization of commercial and personal music collections. Often, music tracks are described as a set of timbre-inspired sound textures. In shallow-learning systems, the total number of sound textures per track is usually too high, and texture downsampling is necessary to make training tractable. Although previous work has solved this by linear downsampling, no extensive work has been done to evaluate how texture selection benefits genre classification in the context of the bag of frames track descriptions. In this paper, we evaluate the impact of frame selection on automatic music genre classification in a bag of frames scenario. We also present a novel texture selector based on K-Means aimed to identify diverse sound textures within each track. We evaluated texture selection in diverse datasets, four different feature sets, as well as its relationship to a univariate feature selection strategy. The results show that frame selection leads to significant improvement over the single vector baseline on datasets consisting of full-length tracks, regardless of the feature set. Results also indicate that the K-Means texture selector achieves significant improvements over the baseline, using fewer textures per track than the commonly used linear downsampling. The results also suggest that texture selection is complementary to the feature selection strategy evaluated. Our qualitative analysis indicates that texture variety within classes benefits model generalization. Our analysis shows that selecting specific audio excerpts can improve classification performance, and it can be done automatically.



There are no comments yet.


page 14

page 19

page 22

page 23

page 24

page 25

page 26


A Computational Analysis of Real-World DJ Mixes using Mix-To-Track Subsequence Alignment

A DJ mix is a sequence of music tracks concatenated seamlessly, typicall...

A Lightweight Music Texture Transfer System

Deep learning researches on the transformation problems for image and te...

Texture-aware Video Frame Interpolation

Temporal interpolation has the potential to be a powerful tool for video...

Reverb Conversion of Mixed Vocal Tracks Using an End-to-end Convolutional Deep Neural Network

Reverb plays a critical role in music production, where it provides list...

Music Style Classification with Compared Methods in XGB and BPNN

Scientists have used many different classification methods to solve the ...

Considering Durations and Replays to Improve Music Recommender Systems

The consumption of music has its specificities in comparison with other ...

Visualizing and Describing Fine-grained Categories as Textures

We analyze how categories from recent FGVC challenges can be described b...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Genre is a descriptive tag commonly associated with music tracks. It relates to the social groups that participate in the process of producing, marketing and consuming particular music pieces or styles. Music genre also relates to the instruments and techniques used for musical composition, performance, and perception, and this reflects on the spectral patterns that are present in its digital audio recordings. Such patterns can be used to devise mathematical models for sound perception that allow automatically associating genre tags to digital music, that is, automatic Music Genre Classification (MGC) [1]. This task is also referred to as automatic Music Genre Recognition (MGR).

Many MGC systems rely on assuming that spectral patterns that allow predicting genre are typically a few (1 to 5) seconds long. These patterns were described by Tzanetakis and Cook [1] as textures. They were modelled with low-order statistics of low-level features calculated in short-time (around 20 ms) frames [1] .

One possible method to use the sequence of textures in a track consists of summarizing them into a single vector [1]. This method, which we call Full Track Statistics (FTS), results in a compact encoding, but can discard texture information that can be useful for genre classification.

Another possibility for such is to use a collection of textures to represent each track. This allows for richer descriptions of the underlying genre tags because the collections comprise distinct sound textures that are present in each musical piece. Many MGC approaches use collections of textures and disregard their time-domain ordering. These approaches can be seen as variants of the Bag of Frames (BoF) [2]. There are many approaches for combining the textures in classification, including statistical modelling [2], dictionary learning and quantization [3, 4, 5], and independent texture labelling combined by voting procedures [6, 7, 8, 9, 10]. Using diverse textures for classification avoids the problem of discarding information, but increase the computational cost for dataset processing and storage.

In this work, we argue that selecting specific, typical textures from each track is an effective way to reduce the computational workload in storage and processing, while preserving classification accuracy. For such, we propose to select textures prototypes from each track using K-Means clustering centroids. These centroids are used for further classification steps.

We compare the results of our texture selection approach to linear downsampling, which is widely used in previous research [6, 7, 8, 9, 10]. We also investigate how texture selection compares with feature selection. Our results show that, in the datasets employed in this work, K-Means centroid downsampling leads to higher classification accuracy using less data than linear downsampling, and that texture selection in time has a greater impact in results than feature selection. Also, feature-space analysis indicate that this improvement depends on the presence of texture diversity within the tracks, which leads to worse results in datasets containing short (10s-long or 30s-long) excerpts.

Another concern related to texture selection regards the feature set used to represent each texture. For such, there are two typical approaches: using handcrafted features [1] and using automatically obtained, data-driven features [10]. In this work, we evaluate both possibilities, and, additionally, we evaluate using random projections of Mel-spectrograms as features. Our results show that K-Means texture selection has a greater impact in accuracy than changing the feature sets in the evaluated datasets.

This paper is organized as follows: Section 2 presents some relevant related research. Section 3 details our proposed texture selection approach. Section 4 details our experimental setup and results. Section 5 presents a qualitative discussion on how our approach contrasts with the commonly used linear downsampling.

2 Related Work

A significant portion of research in MGC uses a single feature vector to represent each track in their datasets. Work by Tzanetakis and Cook [1] uses the mean value of each feature as dimensions for their feature vector. Banyia et al. [11] summarizes the textures using higher-order feature statistics and a feature covariance matrix. More recently, work by Choi et al. [12] uses a CNN to extract features from mel-spectrograms and summarizes the output of each layer using average pooling, leading to a single vector representation.

Other studies represent tracks using a collection of textures. Yandre et al. [10] represent each track by slicing its spectrogram into a set of 50 non-overlapping patches that comprise the middle 60s of each track. Each patch

is treated independently during both training and testing phases. A final classification is decided by summing the classification probabilities for each

patch. This linear slicing setup along with the sum voting rule implicitly assumes that each patch is equally relevant for genre classification. Other related work have also used variants of this linearly-spaced texture selection method [6, 7, 8, 9].

Aucouturier and Pachet [13]

presented a representation in which each track is described by a Gaussian Mixture Model (GMM) that models the probability of any frame being associated with the track. Differently from the linear-selection procedures, this representation models the relevance difference between the textures within tracks. Later, Aucouturier et al.

[2] popularized the term Bag of Frames, highlighting that this description takes into account that collections of textures characterize audio tags, while disregarding long-term temporal behavior.

Later, Marques et al. [3]

evaluated various descriptors based on the dictionary learning approach. In this work, tracks are represented as a histogram of occurrences of dictionary elements. As a baseline, they build the dictionary from sampling uniformly-distributed random frames. Their results show that using random frames as dictionary entries achieves at least the same classification performance as selecting the most representative frames.

Lopes et al. [14] observe that Bag of Frames approaches implicitly assume that all textures in a given track convey relevant information for genre classification. They argue that not every texture is useful for determining the decision boundaries among genres. To solve this problem, they propose a technique for selecting discriminative textures. The proposed method resembles the ideas behind wrapper feature selection strategies [15]. Their results for texture selection are not significantly better than the baseline with no selection. Nevertheless, their work raised an interesting question regarding the relevance of individual textures to represent music genres.

Bag of Frames approaches acknowledges the fact that music tracks do not contain homogeneous textures. On the contrary, each music genre can be related to several typical textures. For example, we are likely to hear the sound of electric guitar solos and fast drum rolls in heavy metal music, but not in baroque pieces.

We propose to identify these typical textures within each track by clustering their vector representations with a K-Means clustering algorithm. This allows identifying the variety of textures within a track regardless of their frequency of occurrence. The K-Means centroids are then yielded to a machine-learning algorithm that performs texture-level classification. After that, the track is classified using a majority-voting procedure along the textures.

This proposal contrasts with codebook approaches [5, 4, 3], in which K-Means is used to learn dictionaries for quantization and tracks are represented by a histogram of codewords. The codebook approach represents texture variety using occurrence ratios. On the contrary, we propose to directly use the centroids yielded by K-Means as texture representations. This allows typical sounds to be directly represented in the feature space, and the machine learning algorithm groups them into genre-typical textures.

Our method is described in the next section.

3 Proposed Method

This section describes the proposed texture selection method. Section 3.1 describes the baseline texture linear downsampling procedure. After that, Section 3.2 describes the proposed K-Means texture selection method. These methods assume that textures are represented as vectors spanned by perceptually-related features. The feature sets used in this work are described along with the experimental setup in Section 4.

3.1 Linear Texture Downsampling

The main concern presented by previous research regarding using collections of textures to describe music tracks is the amount of computing power needed to train models. Many learning algorithms do not scale well as the number of training vectors increases. Therefore, it is necessary to reduce the number of textures used for training, specially in systems that do not rely on neural networks.

A number of previous research solves this problem by downsampling textures of each track [10]. Some variations of this method have been used before by many authors such as in [10, 6, 9, 8, 7].

A common strategy is to pick linearly-spaced textures along the time axis. This procedure is shown in Figure 1. We call this strategy LINSPACE Texture Selection. Mathematically, a track texture matrix with textures and features , , , is summarized by LINSPACE yielding a matrix such that , where and is a parameter that controls the granularity of the downsampling procedure.

Figure 1: LINSPACE Texture Selection

LINSPACE does not use content to select representative textures for genre classification. In contrast, the proposed K-Means Texture Selection aims to select appropriate textures based on the premise that there are typical sounds that are likely to appear in tracks of the corresponding genres.

3.2 K-Means Texture Selection

K-Means is a well-known clustering algorithm introduced in [16]

. It works by iteratively estimating points, known as

centroids, that characterize different trends in a dataset. These points can then be used to query the dataset for other points following the trend.

As stated before, each audio texture is represented by a vector in a perceptually-inspired space , where is the number of features that span the space. Thus, a music track can be described by a texture matrix where is the number of textures in the track and is the number of features. When K-Means is applied to , it estimates a centroid matrix , where is the number of centroids. These centroids can be interpreted as vector representations of the acoustic trends in track , and is the number of trends that are extracted from the audio track. At the end of this process, is summarized into , and is yielded to further classification steps. Figure 2 depicts this procedure.

Figure 2: K-Means Texture Selection

We call this method K-Means Texture Selection. Because K-Means tends to yield centroids that are distant from each other, we can expect that the matrix contains information that tends to be diverse. As centroids are more likely to be located around denser point clouds, they are also likely to represent typical textures. As a result, the matrix highlights diverse typical sounds within the track. To the best of our knowledge, K-Means has not been used before to select representative textures for genre classification.

K-Means Texture Selection is compared to Linear Texture Downsampling in music genre classification experiments, as discussed next.

4 Experiments and Results

In this section we present experiments that aim to investigate how texture selection impacts music genre classification performance. Specifically, we are interested in the effectiveness of selecting representative textures to describe a track with K-Means clustering. To this end, we evaluated four texture selection methods: K-Means clustering (KMEANSC), described in Section 3.2, LINSPACE downsampling, described in Section 3.1, a FTS vector for each track, as proposed by [1], and using all textures with no summarization or downsampling (ALL). Our main objective was to evaluate how texture selection impacts classification performance in systems where tracks are represented by collections of textures.

To evaluate the texture selection effectiveness in different scenarios, we designed classification systems taking into consideration two components that are commonly evaluated in MGC systems: the feature sets, and the use of feature selection. Four feature sets of varying abstraction levels, and an univariate correlation filter for feature selection were evaluated along with the texture selection methods.

Our experimental setup is thoroughly described in Section 4.1. We note that our goal with these experiments is to investigate the effects of texture selection in a variety of MGC scenarios. In other words, we aimed at generating insight on the underlying mechanisms behind the result differences. For such, we executed MGC experiments using different texture selection methods, as discussed in Section 4.2, diverse feature sets, as shown in Section 4.3, and datasets with different characteristics, as discussed in Section 4.4. After that, Section 4.5 presents the parameterization of the systems evaluated, and Section 4.6 shows classification results.

4.1 Experimental Setup

The system architecture is shown in Figure 6. It consists of three stages: Texture Calculation (Figure (a)a), Model Training (Figure (b)b), and Model Testing (Figure (c)c). Texture Calculation begins by extracting features over ms frames from music tracks sampled at KHz. For each track, a feature matrix (A) is calculated, consisting of a feature vector for every frame. The features, their first and second-order deltas are calculated and concatenated into a single vector per frame. Textures are then calculated by aggregating 216 consecutive frames, resulting in textures that cover approximately of the track. Successive textures are calculated every 10 frames, resulting in a

x downsample and around 95% overlap between textures. We used both average and standard deviation as aggregation functions over every feature. The resulting texture matrix (T) is calculated for every track in the dataset.

The results of the texture calculation are yielded to a machine-learning algorithm. It relies on two different stages: model training and model testing. The model training stage uses a part of the dataset to estimate its latent parameters. The model testing stage uses another, held-out part of the dataset to evaluate the classification performance.

The input to the Model Training stage (Figure (b)b) is the training set (), which is the set of texture matrices corresponding to the training tracks. First, textures are standardised feature-wise to mean , standard deviation . The standardisation parameters are saved to be applied later to the testing set. The resulting standardised training set is given by . Texture selection is performed on every texture matrix of the training set independently. The number of desired textures per track is a parameter, which we call . The texture selector outputs a training Feature Matrix (F), which concatenates all selected textures from every track in the training set. Assuming all selected textures are representative of the genre, every texture of the same track are assigned the track label. A classifier is trained with the samples in , along with the corresponding labels. The classifier () can be seen as a function that maps each texture to a label

The Model Testing stage (Figure (c)c) receives a testing set () as input, which is the set of texture matrices corresponding to the testing tracks. The same standardisation parameters applied to the training set are applied to , resulting in a standardised testing set ().

Similarly to the training samples, texture selection is performed on each track independently for each matrix in the testing set. The same number of textures per track are used for both training and testing. The texture selector outputs a testing matrix (), with every selected texture of each track in the testing set. Then, is applied row-wise to , resulting in a Texture Label Matrix (), with a label prediction for every selected texture of every track in the testing set. A final classification is decided for each track by majority voting. The output for the entire testing set is the prediction matrix ().

(a) Texture Calculation

(b) Model Training
(c) Model Testing
Figure 6: System Architecture

4.2 Texture Selection

We used four different texture selection approaches. Both KMEANSC and LINSPACE were described in Section 3

. As stated, they assume that a track is best described by a set of textures, instead of aggregating the description into a single texture. To test this hypothesis, we evaluated the classification performance using FTS as well. In this approach, a single texture is calculated for each track by aggregating all textures with average and variance.

We also evaluated how selecting relevant textures impacts classification performance. For this, we evaluated classification performance using ALL textures to represent a track. Naturally, the higher the number of textures used to describe tracks, the higher the cost of training and testing classification systems. Thus, by comparing the results of texture selection with ALL, we are able to determine the relationship between the number of textures per track and classification performance, as well as how selecting a subset of the textures affects performance.

Because the four texture selection techniques presented have distinct computation requirements, evaluating the resulting classification performance is important. This analysis can lead to guidelines for choosing the most appropriate texture selector given the task size and computing capabilities.

4.3 Feature Sets

We evaluated texture selection performance along with four different feature sets with different underlying assumptions. This allows us to evaluate whether the change in performance due to texture selection is dependent on the feature set describing the textures. It also allows us to draw conclusions on key factors for improving classification performance, in particular: how relevant are the low-level descriptions for music genre classification, and what is the relevance of using more textures to train classification models.

Mel-Scale Spectrograms

Two feature sets derive directly from audio data, and were used as “raw” features: MEL-SPEC and MEL-RP. The MEL-SPEC feature extraction is shown in Figure

(a)a. It starts with a track signal in the time domain. Then, a STFT is calculated in frames of samples, with 50% overlap. The absolute value of the STFT is calculated, yielding a magnitude matrix (). Then, a 128-bin Mel Filterbank matrix () was applied to transform into (). We call a Mel-Scale spectrogram (MEL-SPEC). This transformation relies on previous research, which has shown that Mel-Scale spectrograms highlight important aspects of the audio spectrum relevant to genre classification [17].

(a) MEL-SPEC Feature Set
(b) MEL-RP Feature Set
Figure 9: Low-level Feature Sets

Many systems that rely on feature learning have been proposed to work with raw inputs such as MEL-SPEC [12, 17, 18]. We used it to evaluate how such texture descriptions can be used in a scenario where MEL-SPEC is used directly as a feature set, not as input to a feature learning mechanism. We assume that such inputs roughly model our timbre perception by highlighting the magnitude of spectral contents. Thus, it is expected that the Euclidean distance of such texture vectors relates to music content similarity.

Mel-Scale Spectrogram Random Projection

Another feature set derived directly from data is the random projection of mel spectrograms (MEL-RP). MEL-RP feature extraction is shown in Figure (b)b. The first step is to calculate the Mel-Scale Spectrogram for the music track, as shown in Figure (a)a

. Then, a linear transformation

, is applied to , resulting in . The number of columns in is lower than in , which results in a projection into a lower-dimensional space . Because

is a random matrix sampled from a Gaussian distribution with mean

and variance , we call the Mel-Scale Spectrogram Random Projection (MEL-RP). MEL-RP aims at transforming each texture in into a stable embedding, preserving the underlying distance topology. This allows texture classification [19] in the projected space, instead of the Mel-Scale spectrogram space.

The effectiveness of random projections for dimensionality reduction is well-known in machine learning literature [20]. The Johnson-Lindenstrauss (JL) lemma [21] states that a random matrix , when , projects the points in into a stable embedding with high probability if , . is the data original dimensionality, is the target dimensionality, and is the number of points in . The constant quantifies the distortion introduced by the random transformation. It follows that, as more distortion is allowed, the smaller M can be. Thus, is a parameter that can be implicitly adjusted according to the application by optimizing in relation to a classification system performance measure.

Since the dimension of the embedding domain is lower (), the computational effort needed for training a machine learning model in is expected to be smaller than a model in

. Furthermore, for suitable classifiers, the curse of dimensionality can be alleviated by transforming data points into a lower-dimensional space.

To the best of our knowledge, no previous research has employed MEL-RP as a feature set for genre classification. A similar idea, proposed by Choi [12]

, consists in setting the weights of a convolutional neural network to random values. The classification results are used as baseline for other approaches. The main difference is that the architecture is made up of convolution operations followed by non-linear activation functions. Choi has reported satisfactory results with this method. Furthermore, other authors have used random projections as a dimensionality reduction tool in further related works

[22, 23, 24].

Handcrafted Features

Classification systems with handcrafted features were also evaluated. For simplicity sake, we call the chosen subset HANDCRAFTED. The features used are well-known in the music information retrieval literature and some of them are known as discriminative features for some tasks. Handcrafted features are based on specialist knowledge. Therefore, they can be seen as features at a higher abstraction level than the “raw” feature sets presented earlier. In the context of our work, we are interested in how handcrafted features work in tandem with texture selection.

The handcrafted features selected for the experiments are calculated for every frame from the STFT magnitude spectrum. First, the audio input signal sampled at 44Khz is centered at mean 0, variance 1. Then, a sample STFT is calculated with overlap over consecutive frames. A hamming window is applied to prevent high-frequency artifacts from slicing.

The following equations describes the features in our HANDCRAFTED feature set. is the magnitude of the FFT of a given frame at frequency bin , and is the total number of frequency bins. [1] describe the Spectral Centroid as a measure of spectral brightness. It is calculated from the following equation:

Spectral Rolloff is a measure of spectral shape [1]. It corresponds to the value in the following equation:

The Spectral Flux, which is a measure of spectral variation [1], can be calculated by:

Energy is a measure of signal strength, is calculated by:

Spectral Flatness is a measure of noise in an audio signal [25], and it is calculated by:

Zero Crossing Rate is another measure of noise in an audio signal. It is calculated over the time-domain signal, :

where if , otherwise .

MFCCs (Mel-Frequency Cepstral Coefficients) [26] are also part of the HANDCRAFTED feature set. MFCCs are widely used by the speech recognition and music information retrieval research communities. In MIR, they are commonly used as timbre descriptors.

In total, there are 26 features in the HANCRAFTED feature set: Spectral Centroid, Spectral Rolloff, Spectral Flux, Spectral Flatness, Energy, Zero Crossing Rate and the first 20 MFCC coefficients. We are interested in how much these rather simple features work along our proposed texture selection. Specifically, we want to verify how the classification results compare to more sophisticated feature sets, such as an autoencoder representation, discussed next.

Mel-Scale Autoencoder

An autoencoder is a neural network that learns a mapping from the input to the input itself [27]

. A well-known architecture is based on a series of fully connected neuron layers. The middle layer is known as the

bottleneck, is also known as the latent representation, and usually has a lower dimensionality than the other layers. The idea is that the bottleneck represents a compressed form of the input signal, hence transforming the input into a lower dimensionality vector.

In contrast to the other feature sets presented above, autoencoders provide features learned directly from data. Figure 10 shows the architecture of the autoencoder used in this paper. Mel Spectrograms are the input to the autoencoder, hence its output as well. The number of neurons in the bottleneck is a parameter (

). A non-linear ReLU activation function is used in the bottleneck layer, forcing a non-linear projection of the input. The output layer uses a linear activation function, thus forcing a linear reconstruction of the input signal in the output space. This is an attempt to promote linear separability among different clusters of similar points. The bottleneck layer activations are used as features. We call this feature set MEL-Autoencoder (MEL-AE).

Figure 10: Autoencoder Architecture

In preliminary tests we experimented with different autoencoder architectures. The final mean squared error was evaluated for different architectures and the results differed within small error margins. Thus, by Occam’s principle, we chose the simplest architecture in terms of number of neurons and number of layers.

4.4 Datasets

All classification systems were evaluated with four different publicly available datasets: GTZAN [1], ISMIR [28], LMD [29], and HOMBURG [30]. We chose these datasets because each one presents particular challenges. Evaluating system performance on datasets with different challenges leads to a better understanding of how effective the proposed methods are under different situations. They are also well-known in the music information retrieval community. Table 1 summarizes the datasets used in the experiments and shows their diversity regarding genres, number of tracks, track length and number of textures.

Datasets Genres Tracks Textures Track Length
Avg. Textures
per Track
GTZAN 10 1000 107K 30s 107
ISMIR 6 1458 1,5M FULL 1046
ISMIR (10s) 6 1458 31K 10s 20
LMD 10 1300 1,2M FULL 940
LMD (10s) 10 1300 27K 10s 21
HOMBURG 9 1886 40K 10s 21
Table 1: Datasets used in the experiments

GTZAN [1] was one of the first widely available datasets for MGC and it is widely known in the research community. This dataset consists of 1000 music clips of 10 western music genres. GTZAN is balanced among its 10 genres (blues, classical, country, disco, hip-hop, jazz, metal, pop, reggae, rock), with 100 clips each. Each music clip is 30s long. Its flaws are well-known by the research community. The main faults are: repeated tracks, multiple tracks from the same artist, cover tracks, mislabelings and extreme distortions [31]. To address these faults, we created a 3-fold dataset split following all the recommendations in [31]. We call this split GTZAN-ARTF in the results section. Although Sturm shows that there are still some unidentified excerpts in GTZAN, the proposed changes fixes most of the identified faults. The ARTF split we used is available online111https://github.com/julianofoleiss/gtzan_sturm_filter_3folds_stratified/.

This 3-fold split is artist-filtered, which means that tracks from same artist and genre do not appear in both training and testing sets. It has been shown that artist-filtered splits makes the genre classification problem significantly harder [32, 33]. As a consequence, the classifier performances are lower. One of the reasons for this is that the classifier can learn patterns that relate to specific artists, such as the voice timbre of the lead singer, in contrast to patterns relating to genre, regardless of artist-specific characteristics. Not rarely, an artist works solely on a single genre, thus all of their tracks are labelled with the same genre. Hence, a classification system may correctly predict a genre to a track based on artist-specific patterns, instead of genre-defining patterns. This causes songs from the same artist in the test set to be easier to classify, but this does not mean that the classifier has generalized well to other artists performing on the same genre.

GTZAN is a widely used dataset for evaluating genre classification systems [34]. Often, a random 10-fold split is used, hence ignoring the artist filtering problem. For comparison purposes, we also included the results with this random split. We call this split GTZAN-RANDOM.

LMD (Latin Music Dataset) [29] is another well known dataset that consists of popular Latin-American music genres. This dataset was originally labelled according to dancing patterns, specially for deciding between genres that are similar regarding timbre, such as axé/pagode and gaúcha/sertaneja. We used an artist-filtered subset of LMD in our experiments, which avoids the same problems present in the GTZAN dataset and allows a balanced 3-fold split. We used the same split as in [35, 10]. We call this split LMD-ARTF. This subset consists of 1300 full-length music tracks across 10 different genres, each with 130 tracks. The LMD-ARTF split we used is available online222https://github.com/julianofoleiss/lmd_3f_stratified_artist_filter/.

The ISMIR 2004 [28] dataset consists of 1458 full-length music tracks across 6 western genres. We used the train/test split supplied in the ISMIR 2004 challenge homepage, with 729 tracks in both the training and testing sets. No artist filter was applied to this dataset because the artist information is not available for all tracks, only for tracks in the training set. This dataset is not balanced, thus the number of tracks is not the same for all classes. The majority of the tracks in the training set are classical (320), followed by electronic (115), jazz_blues (26), metal_punk (45), rock_pop (101) and, world (122). The number of tracks per genre in the testing set is similar to the training set.

HOMBURG [30] is a lesser known dataset that also consists of western music genres. It comprises 1886 music clips across 9 genres. This dataset is interesting for evaluating texture selection techniques, since the tracks are only 10s long. We used a standard 10-fold random split for cross validation. We did not apply an artist filter to HOMBURG because the number of tracks per artist is too low, as shown in Table 2.

Genre Tracks Artists Avg. Tracks per Artist
Alternative 145 121 1.20
Blues 120 80 1.50
Electronic 113 97 1.16
Folk/Country 222 177 1.25
Funk/Soul/RnB 47 39 1.21
Jazz 319 214 1.49
Pop 116 106 1.09
Raphiphop 300 210 1.43
Rock 504 447 1.13
TOTAL 1886 1491
Weighed Avg. 1.28
Table 2: Number of Tracks and Artists in the HOMBURG dataset

4.5 Configurations

As presented earlier, Figure 6 shows the general architecture of the classification systems evaluated. The highlighted Feature Extractor, Texture Selector, Classifier Training

boxes represent the components we experimented with different hyperparameter settings.

The Feature Extractor parameters evaluated are shown in Table 3. The four feature sets evaluated were presented in Section 4.3. A single feature set was evaluated at a time. The MEL-RP and MEL-AE input spectrograms are calculated with the same parameters as MEL-SPEC. HANDCRAFTED features are calculated from the STFT instead.

Four target dimension sizes for MEL-RP were chosen linearly between 25 and 100. 26 was used instead of 25 because it allows us to compare the results in terms of dimensionality directly to HANDCRAFTED features, which also has 26 features. To get an idea of how the system performed at a very low dimensionality, we also evaluated the system with only 8 random projection features. For MEL-AE, five different bottleneck sizes are evaluated. They represent compression rates of times, respectively.

The number of features in each feature set is shown in Table 4. Column Texture Size shows the actual number of features in the texture vector after the aggregation procedures. As described earlier, frame-level features are aggregated with both first and second-order deltas, as well as the original features. Then, frame-level features are aggregated into textures using both the average and standard deviation. This aggregation strategy increases the number of features by a factor of 6.

Feature Sets Parameters Values
MEL-SPEC Mel Bins 128
FFT Size 2048
Hop Size 1024
Window Hanning
MEL-RP Target Dim (M) {8, 26, 51, 75, 100}
MEL-AE Bottleneck Dim (H) {16, 32, 64, 128, 256}
Hop Size 1024
MFCC Coeff. 20
Window Hanning
Table 3: Feature Extraction Parameters

Feature Sets Features Texture Size
MEL-RP 8 48
26 156
51 306
75 450
100 600
MEL-SPEC 128 768
MEL-AE 16 96
32 192
64 384
128 768
256 1536
Table 4: Number of Features and Texture Size per Feature Set

The Texture Selector parameters are shown in Table 5. Two baselines are used: FTS vectors and ALL textures. Since FTS implicitly assumes that the music track can be described by a roughly homogeneous texture, we expect it to be the lower baseline for all experiments. We also present the results using ALL textures to describe each track. This elevates computing costs, specially for the training phase of the SVM learning algorithm. In these cases, we used the ThunderSVM implementation [36]

to train the SVM using a GPU. The ALL baseline gives us in idea of how effective our other two texture selection strategies are. However, we do not consider ALL to be the ceiling for texture selection classification performance. This is related to how some textures within a track can be outliers.

A single parameter is evaluated for both KMEANSC and LINSPACE: the number of textures used to describe each track. A low number such as 5 textures is used to evaluate which selector achieves good classification performance with lower computing power requirements. A good classification with few textures also gives us an idea of the relevance of the selected textures with respect to the target genre. We also evaluate the systems with an increasing number of textures. This allows us to determine the importance of the number of textures for genre classification.

Texture Selectors Parameters Values
KMEANSC # of Textures (K) {5, 20, 40}
FTS Aggregation MEAN+STDEV
Table 5: Texture Selection Parameters

The Classifier Training procedure consists of feature selection and training the classifier. We chose to use a simple univariate correlation filter for feature selection. A Pearson correlation coefficient is calculated for each feature in relation to the corresponding training labels. Only the features corresponding to the highest scoring coefficients are used for training. This is known as ANOVA feature selection. The number of features to keep is a parameter that should be evaluated. We chose to keep a fraction of the total number of features. The fractions evaluated are shown in Table 6. The same features are then used for model testing later. Its important to keep in mind that our goal is to understand the effects of texture selection, not necessarily achieve the best possible results. With that in perspective, this feature selection filter is used because it is intuitive and well-known, thus simplifying analysis.

Parameters Values
Fraction to Keep {0.2, 0.4, 0.6, 0.8}
Table 6: ANOVA Feature Selection Parameters

Two learning algorithms are used for mapping textures into genres. The Support Vector Machine (SVM) is a well-known learning algorithm used in various machine learning music applications. It relies on mapping points into a higher dimensionality space where they become linearly separable by a hyperplane


. Thus, by design, SVM is not only robust to high-dimensional data, but projecting data into high-dimensional spaces is part of its strategy to solve the pattern classification problem. Two regularization parameters responsible for softening the decision boundary are usually optimized, namely

and . Table 7 shows the SVM parameters evaluated in the experiments. The main drawback of the SVM is the time complexity of its training algorithm, which depends on the number of training samples, the dimensionality of the training vectors and the type of kernel.

Learning Algorithm Parameter Values
SVM C {1, 10, 100, 1000}
Kernel rbf
KNN K {1, 3, 5, 7, 9}
Table 7: Learning Algorithm Parameters

The K-Nearest Neighbors (KNN) learning algorithm is another well-known learning algorithm. It uses a distance metric to measure similarity between points, in which similarity is inversely proportional to distance. Assuming that the feature sets capture relevant information, similar-sounding music excerpts yield points that are close in feature space. A commonly used distance metric is the Euclidean distance, which assumes that the weight of each feature is the same. KNN classifies an unknown point by calculating its nearest points. Then, a majority vote with the corresponding labels is used make the final decision. Table 7 shows the values for we evaluate in our experiments.

The main advantage of KNN is that it can be implemented efficiently. A distance tree can lower the nearest neighbor computation to a logarithmic time with respect to the number of training samples, thus making predictions very efficient. Thus, training consists of building the distance tree, which can be done efficiently. Another advantage is related to its interpretability, because euclidean distance can be immediately related to similarity. However, interpretability is conditioned on the interpretability of the individual features of the dataset. A drawback of KNN is that it suffers from the curse of dimensionality, resulting in degraded classification performance in high-dimensional spaces.

4.6 Results

The results presented in this section are F1-scores weighed by class support. We report average F1-scores and standard deviations across folds. The number of folds varies depending on the dataset, as explained in Section 4.4. For each fold, 80% of the training set textures are randomly selected for model training and 20% are for validation of hyperparameters. The parameters evaluated for feature selection and for each learning algorithm are shown in Tables 6 and 7, respectively. Both the learning algorithm parameters and feature selection parameters were optimized at the same time. For both SVM and KNN, models were evaluated for each (texture selection feature set) combination, where is the number of classifier parameter combinations and is the number of feature selection parameter combinations.

To increase readability, we use the word “significant” exclusively to mean “statistically significant”. Unless otherwise stated, we used a two-tailed paired Student’s T-test to test for statistical significance between any two sample measurements. We used the p-value threshold of 5% for rejecting the null-hypothesis. Also regarding statistical wording, we use the word “comparable” to mean “a difference lower than two standard deviations”.

Autoencoders were trained for each fold independently. This was done to keep a consistent separation between training and testing data, just as training and testing data are separated to evaluate the classification system as a whole. The number of features are parameters in both MEL-RP and MEL-AE. For brevity sake we present the results for and . In most cases, results improve as the number of features increases up until and . After that, performance saturates and does not significantly increase any further.

Considering the stochastic behavior in the KMEANSC texture selector resulting from the randomness of the initial centroid candidates, all the experiments were performed three times. The classification performance difference was not statistically significant in any case, thus we report the median results among all three runs.

To increase readability, we use the St notation to refer to a texture selector S selecting t textures per track. For instance LINSPACE5 indicates that LINSPACE is being used to select 5 textures. Another example is KMEANSC20+, which refers to the texture selection setups with KMEANSC as the texture selector with 20 textures or more.

Preliminary experiments showed us that there are considerable improvements on classification performance for full-length music tracks with texture selection. To determine whether the length of the tracks is relevant for improvement with texture selection regardless of the dataset, we also experimented on 10s samples with both ISMIR and LMD datasets, which are available as full-length tracks. This is the same length of the tracks in the notoriously difficult HOMBURG dataset. The 10s samples for both ISMIR and LMD are from the middle of each song. We call the sliced datasets ISMIR-10s and LMD-10s. The folds used to evaluate these datasets are the same as the folds used to evaluate their full-length equivalents, making comparison straightforward. For clarity, we present the results using full-length tracks first, followed by the results on short-length tracks.

Full-length track Datasets

The results with SVM for the datasets with full-length tracks, that is, LMD and ISMIR, are shown in Figures 13 and 16, respectively. For both LMD and ISMIR, the results for KMEANSC and LINSPACE using more than 5 textures per track are significantly better than using FTS. This result applies for all feature sets, and the best results are similar across feature sets.

For KMEANSC5 textures the results are comparable to the results with more than textures. On the other hand, the results with LINSPACE5 are significantly worse than FTS in almost all cases. This indicates that using more textures per track alone is not enough to improve classification performance. We discuss some possible reasons for this in Section 5.

(a) SVM
Figure 13: Average F1-score with SVM across 3 Artist-Filtered Folds for the LMD Dataset
(a) SVM
Figure 16: F1-score with SVM for the ISMIR Dataset (ISMIR2004 contest train/test split)

Figures 13 and 16 also show the results with ALL textures. For both KMEANSC20+ and LINSPACE20+, results are statistically the same as for ALL textures. Because training a SVM model with all textures is very expensive in terms of computing power, obtaining the same result with a fraction of textures is desirable. This also shows that it is possible to undersample tracks preserving essential information for classification.

The best results shown for both LMD and ISMIR are comparable to the state-of-the-art [38, 39, 40, 10, 35], suggesting that the results reached a glass ceiling. We could not observe statistically significant differences between the results yielded by KMEANSC20+, LINSPACE20+, and ALL.

It is worth highlighting that no significant difference was observed between the results yielded by KMEANSC20+ and LINSPACE20+ in both full-length datasets. Moreover, performance improves as the number of textures per track increases in both cases. However, there is a clear trade-off between computing power required for training and classification performance. In most cases, KMEANSC5 achieves results comparable to ALL in both datasets and for every feature set. Thus, it is reasonable to highlight its performance tradeoff advantage when compared to KMEANSC20+ or LINSPACE20+.

Feature selection is commonly used in MGC systems. Therefore, we also evaluated the relationship between feature selection and texture selection regarding genre classification performance. Figures 13a and 16a shows the results for no feature selection (using all features), while 13b and 16b present the results with ANOVA feature selection. For LMD, feature selection seems to improve results for all texture selection strategies and feature sets. Specifically, feature selection significantly improves results for LINSPACE5, although they are still statistically worse than the results obtained with KMEANSC20+, LINSPACE20+ or ALL. Results for FTS are also improved in some cases. However, the feature selection improvements on FTS are not statistically significant. This suggests that the result improvement due to texture selection is greater than that obtained by using ANOVA feature selection in the LMD dataset.

For ISMIR, feature selection led to positive improvements with LINSPACE5. However, none of the LINSPACE5 results are comparable to the ones achieved with the remaining texture selectors. This behavior is observed both for the results with and without feature selection.

Even in the worst cases feature selection does not significantly decrease performance compared to the cases where no feature selection takes place. There are also some cases where feature selection improves results along texture selection. This suggests that texture and feature selection are complementary towards classification in both full-length datasets evaluated.

Results with the KNN learning algorithm for both LMD and ISMIR are shown in Figures 52 and 61, respectively. As with SVM, results improve as the number of textures per track increases for both KMEANSC and LINSPACE with all four feature sets. LINSPACE5 does not improve results when compared to FTS, as with SVM.

Without feature selection (Figure 52a), in contrast to SVM, the results with KMEANSC20+ for LMD are better than using ALL textures. Specifically, KMEANSC20+ is significantly better than ALL with the MEL-SPEC-AE feature set. The KNN algorithm is known for being less robust to outliers than SVM. As argued before, using ALL textures may include textures that are not typical of the genre. These untypical textures can be considered outliers, degrading KNN performance. We explore this outlier problem further in Section 5.

Overall, there is a clear improvement in all results with KNN and ANOVA feature selection for the LMD dataset (Figure 52b). Apart from selecting features that are more linearly correlated to the output of the training set, ANOVA also lowers the texture dimensionality. The KNN algorithm is known for performing poorly with high dimensional data. In high dimensionality, the density of point clouds becomes too low, with many points becoming equidistant, thus degenerating euclidean distance. Thus, lowering texture dimensionality with a feature selection algorithm such as ANOVA can improve KNN performance.

For the ISMIR dataset, performance improvements due to feature selection are more modest than with LMD. There is a clear improvement in LINSPACE5, although it still does not achieve the results of the other texture selectors. However, the results with feature selection are all comparable to the results with no feature selection. Thus, feature selection had no significant for this dataset.

In general, the results with SVM are better than with KNN in both LMD and ISMIR. More importantly, the improvement patterns related to the number of textures and texture selectors are the same for both classifiers. In other words, for both KNN and SVM, the results improve as the number of textures used to describe each track increases, except for LINSPACE5+. The results for KMEANSC5+ are superior to LINSPACE5+ for both learning algorithms. Lastly, the results for both LINSPACE20+ and KMEANSC20+ are better or statistically the same as using ALL. This suggests that the improvement from texture selection in full-length datasets is not tied to a particular classifier and can also happen with other classifiers.

Short-length track Datasets

Results with both KNN and SVM for the HOMBURG dataset are shown in Figure 33. In contrast to ISMIR and LMD, which are full-length tracks, the tracks in HOMBURG are only 10s long. The results show that there is no improvement with multiple textures when compared to FTS. Also, we could not observe significant result differences between LINSPACE and KMEANSC in any case, for all feature sets evaluated. We could also observe no significant improvement resulting from feature selection. Therefore, the trends in classification improvement due to texture selection in the full-length datasets are not present in the HOMBURG dataset.

Because the tracks in the HOMBURG dataset are only 10s long, the corresponding textures are more likely to be homogeneous throughout each track. One of the main assumptions with bag of frames approaches is that music tracks are made of a variety of heterogeneous textures. Multiple textures per track are able to describe the varied nature of music. Furthermore, genre models inferred from a greater variety of textures are more likely to represent a more accurate collection of typical textures per genre. With short tracks this heterogeneity assumption is less likely to be true.

We tested this hypothesis by building two additional datasets from ISMIR and LMD, in which we usedonly 10s of the middle of each track. We called the resulting datasets ISMIR-10s and LMD-10s. The results for these shortened datasets allows us determine whether track length is a relevant factor that deters improvement with multiple textures. Because HOMBURG is only available in the 10s form, we are not able to do evaluate it with full tracks.

Figures 52e-h and 61e-h shows the results for LMD-10s and ISMIR-10s, respectively. When compared to the cases where the full-length tracks were used, shown again in 52a-d and 61a-d, the results for both LMD-10s and ISMIR-10s are significantly lower. Furthermore, with SVM in both LMD-10s and ISMIR-10s there is no improvement in F1-score as the number of textures increases.There is also no clear improvement with ANOVA feature selection, except for KNN in LMD-10s. This improvement is most likely related to dimensionality reduction. The results with SVM are significantly better than KNN in all cases for both LMD-10s and ISMIR-10s.

Overall, there was no improvement due to texture selection for any of the three datasets with short tracks. There is also no improvement due to ANOVA feature selection. In fact, none of the improvements shown with full-length tracks are present in the shortened datasets. This suggests that the improvement due to texture selection is dependent on track length. The fact that we evaluated texture selection systems with full-length and 10s tracks of both LMD and ISMIR further provides evidence that texture selection works best with longer tracks.

We also evaluated texture selection on the GTZAN dataset. This dataset is comprised of 30s tracks. The results for GTZAN without an artist filter (GTZAN-RANDOM) are shown in Figure 38. Similar to the case where full-length tracks were used, the results with SVM (Figure 38c) improve as the number of textures increases. This is true for all feature sets. All feature sets reach comparable results. However, contrasting with the full-length tracks, the difference between LINSPACE5 and KMEANSC5 is no longer significant in every case. However, the average result is still better with KMEANSC5.

The results for GTZAN-RANDOM with ANOVA and SVM are shown in Figure 38d. ANOVA seems to significantly improve results, mostly for FTS and LINSPACE5. Specifically, LINSPACE5 became significantly better than FTS for both MEL-SPEC and MEL-SPEC-AE. In most cases FTS performs significantly worse than the remaining texture selectors, except for HANDCRAFTED and MEL-SPEC-RP.

The results for GTZAN-RANDOM with KNN are shown in Figures 38a,b. In general the results are significantly worse than SVM for every experiment. However, the same trend relating an increase in performance along with the increase on the number of textures is also present. Once again, this suggests that the effect of texture selection may not be dependent on the classifier. Furthermore, the effects of feature selection are more pronounced with KNN. This, along with the improved seen in some cases with SVM, shows that texture selection can be used along feature selection and may bring further improvements to classification performance.

As stated previously, we have also evaluated texture selection with an artist-filtered version of the GTZAN dataset (GTZAN-ART). Figure 43 shows the results for GTZAN-ART. The F1-scores are lower than GTZAN-RANDOM in all cases for both SVM and KNN. However, the trend of improvement in classification performance as the number of textures increases is shown in Figure 43c. However, the results with texture selection are not significantly higher than the FTS baseline. A higher variance compared to GTZAN-RANDOM across folds in all experiments shows that the artist filter makes generalization more difficult in GTZAN. This higher variance makes the texture selection results in GTZAN-ART not significantly better than the baseline, although the average results are better.

The results for GTZAN-ART with KNN are shown in Figures 43a,b. Just as with GTZAN-RANDOM, the results with SVM are superior to KNN. Again, the classification performance improvements are also present as the number of textures increases. Feature selection improves KNN results more than SVM, although the improvements are not significant in most cases.

The positive result for texture selection in both GTZAN-RANDOM and GTZAN-ART is interesting because it shows that it may lead to gains in cases where the entire track is not available. In the case of GTZAN-RANDOM, the gains were significant compared to the FTS representation.

The presented results shows that the effectiveness of texture selection is dependent on the length of the tracks in the training dataset. When track representations are heterogeneous with respect to their respective textures, they are more likely to capture different variations of typical genre sounds. This can lead to classification models that capable of greater generalization.

Texture selection had the largest positive impact on datasets made up of full-length tracks. The improvement due to texture selection increases as the number of textures per track increases. This improvement was observed in four different feature sets of varying complexity. In most cases, KMEANSC5 lead to results not significantly different than, or comparable with the results obtained with more than 5 textures. Furthermore, we also showed that in most cases LINSPACE5 does not perform significantly the same as KMEANSC5, and did not improve results over the FTS baseline. These two factors suggest that our proposed K-Means based texture selection technique is a promising candidate for texture selection. It offers a good compromise between additional computing power needed for training models and classification improvement.

We have also shown the interaction between texture selection and univariate feature selection based on ANOVA. Feature selection and texture selection worked in a complementary fashion in most cases. This suggests that using both techniques together can improve results in other full-length datasets.

Considering the positive results achieved with full-length track datasets, we propose evaluating texture selection in future music genre classification systems. Just as feature selection has been shown to greatly improve music classification systems, texture selection may also provide classification improvement at a relatively low extra cost during system training and testing.

In the next section we present qualitative results and discussions aimed towards a better understanding on the effects of texture selection for music genre classification.

5 Qualitative analysis

This section analyzes the feature selection processes in an example audio excerpt. First, we use Principal Component Analysis (PCA) to identify the effects of different texture selection processes in the representation of a single track, as shown in Section

5.1. After that, we use t-SNE unsupervised manifold learning to identify the effects of texture selection in full datasets, as discussed in Section 5.2.

5.1 Single Track Texture Selection

In this section, we show the effects of feature selection in the representation of single tracks. For such, we used PCA projections of textures textures calculated from the GTZAN dataset using the handcrafted feature set. This allows visualizing the projections related to a single track over a genre-related point cloud, as shown in Figure 21.

The discussions conducted here use the track “Wisdom of the Kings”, as played by the Italian Heavy Metal band Rhapsody. This track was chosen because it has both typical Heavy Metal parts, with fast drums and guitars, and interludes composed in Classical style. As shown in Figure (a)a, the textures of the track span over both the Metal and the Classical clouds.

The FTS representation shown in Figure (b)b correctly positions the track close to the Metal genre. However, it discards the information related to the Classical parts. This indicates that FTS representations can fail on representing genre fusion within musical tracks, regardless of their effectiveness for the classification task.

This genre fusion information was preserved both for the K-Means selection (Figure (c)c) and the linear downsampling (Figure (d)d) representations. It is interesting to note that the linear downsampling selected more textures that are part of the most used textures, whereas the K-Means selection selected textures that are more distributed along the space. This indicates that K-Means texture selection is more effective to represent a texture palette that was used in each track.

(a) ALL
(b) FTS
Figure 21: PCA projections of the Texture Selection Approaches over the GTZAN dataset (HANDCRAFTED features)

This means that K-Means texture selection can be used to summarize songs. For such, it is possible to select one texture from each of the K clusters and then concatenate them in a single track. This resulted in Audio Sample 1 333http://bit.ly/2Xabeob, presented as supplementary material. It is possible to use linear downsampling to generate another track summary, as shown in Audio Sample 2 444http://bit.ly/2Qsgl0m. However, this result is generated independently of audio content. This means that the chosen textures are probably those that are used more often in the track, not those that best represent the track diversity.

The next section discusses the effects of using different texture selection methods in the whole dataset.

5.2 Dataset Texture Selection

This section discusses the effects of feature selection in the representation of datasets. For such, we used t-SNE projections of textures calculated from the ISMIR dataset using the handcrafted feature set. We visualize the projection related to the Classical and Electronic genres, as shown in Figure 28.

A comparison between the spaces spanned by selecting 5 textures with K-Means (Figure (a)a) and with linear downsampling (Figure (b)b) shows that the K-Means selection leads to a higher amount of observable clusters. This means that the selected textures are more diverse, which suggests that they can be more informative when training classifiers. This difference disappears when 10s excerpts are used instead of full tracks. This behavior, shown in figures (c)c and (d)d, indicate that the texture selection method is less relevant when the audio tracks contain less information. As a result, there is no observable difference between the results yielded by each method in datasets comprising short excerpts, as discussed in Section 4.6.

When more textures are selected from each track, the corresponding clusters tend to grow and, eventually, merge. This is shown in figures (e)e and (f)f. Nevertheless, the clusters related to linear texture downsampling seem to merge faster than the ones related to K-Means selection.

(a) KMEANSC 5 (full)
(b) LINSPACE 5 (full)
(c) KMEANSC 5 (10s)
(d) LINSPACE 5 (10s)
(e) KMEANSC 20 (full)
(f) LINSPACE 20 (full)
Figure 28: t-SNE embeddings depicting feature spaces derived from texture selection over the ISMIR dataset (HANDCRAFTED features)

These results indicate that texture variety is more relevant than the number of textures used to build classification models. The lack of textural variety can be a result of either the texture selection method, such as a small in LINSPACE, or textural homogeneity within tracks, such as datasets consisting of short clips. Both of these causes lead to less variety, which has a harmful effect to system generalization.

The next section presents conclusive remarks.

6 Conclusion

In this work we presented a study about the effect of texture selection in a variety of machine learning setups. We presented a novel texture selection method based on K-Means clustering, aiming to select diverse typical textures from each track. We evaluated four feature sets at different levels of abstraction, as well as a univariate feature selection strategy. Four publicly available datasets were used in the expriments. Our results show that texture collections improve classification results on the datasets comprising full-length tracks when compared to a full-track statistics (FTS) vector representation.

The results showed that our K-Means texture selection is able to achieve significant improvements over the FTS baseline using only 5 textures per track. The baseline strategy of selecting linearly-spaced textures required at least 20 textures to achieve similar results. K-Means texture selection has shown to reduce the number of textures needed to train models while maintaining the same performance as using all textures from every track.

We also showed that texture variety in the training set is key to improving classification performance. Through a qualitative analysis based on PCA an t-SNE projections, we showed that KMEANSC extracts diverse textures from each track, while LINSPACE is more likely to select textures similar to more common sounds within the track. As a result, LINSPACE needs more textures to be able to capture texture variety and achieve a significant performance improvement.

The results discussed in this article can be used for building automatic music summaries. Also, they can be used as basis for further exploration in texture selection methods for audio and music classification.


The authors would like to thank the University of Campinas and the Federal University of Technology – Paraná for supporting this research through the DINTER agreement. Thanks to Prof. Dr. Rogerio Aparecido Gonçalves for lending GPUs to support large-scale SVM training.

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Declarations of interest: none.


  • [1] G. Tzanetakis and P. Cook. Musical genre classification of audio signals. IEEE Transactions on Speech and Audio Processing, 10(5):293–302, Jul. 2002.
  • [2] Jean-Julien Aucouturier, Boris Defreville, and François Pachet.

    The bag-of-frames approach to audio pattern recognition: A sufficient model for urban soundscapes but not for polyphonic music.

    The Journal of the Acoustical Society of America, 122(2):881–891, 2007.
  • [3] G. Marques, T. Langlois, F. Gouyon, M. Lopes, and M. Sordo. Short-term feature space and Music Genre Classification. Journal of New Music Research, 40:127–137, 2011.
  • [4] C. M. Yeh, L. Su, and Y. Yang. Dual-layer bag-of-frames model for music genre classification. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 246–250, May 2013.
  • [5] Zhouyu Fu, Guojun Lu, Kai Ming Ting, and Dengsheng Zhang. Music classification via the bag-of-features approach. Pattern Recognition Letters, 32(14):1768 – 1777, 2011.
  • [6] Mikael Henaff, Kevin Jarrett, Koray Kavukcuoglu, and Yann LeCun. Unsupervised learning of sparse features for scalable audio classification. In 12th Proceedings of the International Conference on Music Information Retrieval, 2011.
  • [7] Philippe Hamel, Simon Lemieux, Yoshua Bengio, and Douglas Eck. Temporal pooling and multiscale learning for automatic annotation and ranking of music audio. In 12th Proceedings of the International Conference on Music Information Retrieval, 2011.
  • [8] Jan Wülfing and Martin A Riedmiller. Unsupervised learning of local features for music classification. In 13th Proceedings of the International Conference on Music Information Retrieval, 2012.
  • [9] Il-Young Jeong and Kyogu Lee. Learning temporal features using a deep neural network and its application to music genre classification. In 17th Proceedings of the International Conference on Music Information Retrieval, 2016.
  • [10] Yandre M.G. Costa, Luiz S. Oliveira, and Carlos N. Silla. An evaluation of convolutional neural networks for music classification using spectrograms. Applied Soft Computing, 52:28 – 38, 2017.
  • [11] B. K. Baniya, D. Ghimire, and J. Lee. Automatic music genre classification using timbral texture and rhythmic content features. In 2015 17th International Conference on Advanced Communication Technology (ICACT), pages 434–443, July 2015.
  • [12] K. Choi, G. Fazekas, M. Sandler, and K. Cho. Transfer learning for music classification and regression tasks. In Proceedings of the 18th ISMIR Conference, 2017.
  • [13] Jean-Julien Aucouturier and François Pachet. Improving timbre similarity: How high’s the sky? Journal of Negative Results in Speech and Audio Sciences, May 2004.
  • [14] M. Lopes, F. Gouyon, A. L. Koerich, and L. E. S. Oliveira. Selection of Training Instances for Music Genre Classification. In 20th International Conference on Pattern Recognition, pages 4569–4572, Aug 2010.
  • [15] Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning, 46(1):389–422, Jan 2002.
  • [16] J. Macqueen. Some Methods for Classification and Analysis of Multivariate Observations. In 5th Berkeley Symposium on Mathematical Statistics and Probability, pages 281–297, 1967.
  • [17] J. Lee and J. Nam. Multi-level and multi-scale feature aggregation using pretrained convolutional neural networks for music auto-tagging. IEEE Signal Processing Letters, 24(8):1208–1212, Aug 2017.
  • [18] Sander Dieleman and Benjamin Schrauwen. End-to-end learning for music audio. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 6964–6968. IEEE, 2014.
  • [19] Mark A Davenport, Marco F Duarte, Michael B Wakin, Jason N Laska, Dharmpal Takhar, Kevin F Kelly, and Richard G Baraniuk. The smashed filter for compressive classification and target recognition. In Computational Imaging V, volume 6498, page 64980H. International Society for Optics and Photonics, 2007.
  • [20] R. G. Baraniuk, V. Cevher, and M. B. Wakin. Low-dimensional models for dimensionality reduction and signal recovery: A geometric perspective. Proceedings of the IEEE, 98(6):959–971, June 2010.
  • [21] William B Johnson and Joram Lindenstrauss. Extensions of Lipschitz mappings into a Hilbert space. Contemporary Mathematics, (26):189–206, 1984.
  • [22] Y. Panagakis, C. Kotropoulos, and G. R. Arce. Music genre classification via sparse representations of auditory temporal modulations. In 2009 17th European Signal Processing Conference, pages 1–5, Aug 2009.
  • [23] Kaichun K. Chang, Jyh-Shing Roger Jang, and Costas S. Iliopoulos. Music genre classification via compressive sampliaucouturier02ng. In Proceedings of the 11th ISMIR Conference, 2010.
  • [24] Mehdi Banitalebi-Dehkordi and Amin Banitalebi-Dehkordi. Music genre classification using spectral analysis and sparse representation of the signals. Journal of Signal Processing Systems, 74(2):273–280, Feb 2014.
  • [25] S. Dubnov. Generalization of spectral flatness measure for non-gaussian linear processes. IEEE Signal Processing Letters, 11(8):698–701, Aug 2004.
  • [26] M. Hunt, M. Lennig, and P. Mermelstein. Experiments in syllable-based recognition of continuous speech. In ICASSP ’80. IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 5, pages 880–883, April 1980.
  • [27] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. The MIT Press, 2016.
  • [28] The International Society for Music Information Retrieval (ISMIR). ISMIR2004 Audio Description Contest-Genre/Artist ID Classification and Artist Similarity. http://ismir2004.ismir.net-/genrecontest, 2004.
  • [29] Carlos Silla, Alessandro L Koerich, and Celso A A Kaestner. The Latin Music Database. In Proceedings of the 9th International Conference on Music Information Retrieval, Philadelphia, PA, USA, 2008.
  • [30] Helge Homburg, Ingo Mierswa, Bülent Möller, Katharina Morik, and Michael Wurst. A benchmark dataset for audio classification and clustering. In 6th Proceedings of the International Conference on Music Information Retrieval, 2005.
  • [31] Bob L Sturm. The gtzan dataset: Its contents, its faults, their effects on evaluation, and its future use. arXiv preprint arXiv:1306.1461, 2013.
  • [32] Elias Pampalk, Arthur Flexer, and Gerhard Widmer. Improvements of audio-based music similarity and genre classificaton. In Proceedings of the 6th International Conference on Music Information Retrieval, 2005.
  • [33] Arthur Flexer. A closer look on artist filters for musical genre classification. In Proceedings of the 8th International Conference on Music Information Retrieval, pages 341–344, 01 2007.
  • [34] Bob L Sturm. A survey of evaluation in music genre recognition. In International Workshop on Adaptive Multimedia Retrieval, pages 29–66. Springer, 2012.
  • [35] Loris Nanni, Yandre M.G. Costa, Alessandra Lumini, Moo Young Kim, and Seung Ryul Baek. Combining visual and acoustic features for music genre classification. Expert Systems with Applications, 45:108–117, 2016.
  • [36] Zeyi Wen, Jiashuai Shi, Qinbin Li, Bingsheng He, and Jian Chen. ThunderSVM: A fast SVM library on GPUs and CPUs. Journal of Machine Learning Research, 19:1–5, 2018.
  • [37] Vladimir Vapnik. Statistical learning theory. Wiley, 1998.
  • [38] Tim Pohle, Dominik Schnitzer, Markus Schedl, Peter Knees, and Gerhard Widmer. On Rhythm and General Music Similarity. In 10th International Society for Music Information Retrieval Conference, pages 525–530, 2009.
  • [39] Shin Cheol Lim, Jong Seol Lee, Sei Jin Jang, Soek Pil Lee, and Moo Young Kim. Music-genre classification system based on spectro-temporal features and feature selection. IEEE Transactions on Consumer Electronics, 58(4):1262–1268, 2012.
  • [40] Klaus Seyerlehner. Content-Based Music Recommender Systems: Beyond simple Frame-Level Audio Similarity. PhD thesis, Johannes Kepler Universität Linz, Linz, Dec 2010.

Appendix A KNN and SVM Results for All Datasets

(a) KNN
(c) SVM
Figure 33: Average F1-score across 10 RANDOM Folds for the HOMBURG Dataset
(a) KNN
(c) SVM
Figure 38: Average F1-score across 10 RANDOM Folds for the GTZAN Dataset
(a) KNN
(c) SVM
Figure 43: Average F1-score across 3 ARTF Folds for the GTZAN Dataset
(a) KNN
(c) SVM
(e) KNN (10s)
(f) KNN+ANOVA (10s)
(g) SVM (10s)
(h) SVM+ANOVA (10s)
Figure 52: Average F1-score across 3 Folds for the LMD Dataset
(a) KNN
(c) SVM
(e) KNN (10s)
(f) KNN+ANOVA (10s)
(g) SVM (10s)
(h) SVM+ANOVA (10s)
Figure 61: F1-scores for the ISMIR Dataset