Feature learning and deep learning have drawn great attention in recent years as a way of transforming input data into more effective representations using learning algorithms. Such interest has grown in the area of music information retrieval (MIR) as well, particularly in music audio classification tasks such as auto-tagging. In this paper, we present a two-stage learning model to effectively predict multiple labels from music audio. The first stage learns to project local spectral patterns of an audio track onto a high-dimensional sparse space in an unsupervised manner and summarizes the audio track as a bag-of-features. The second stage successively performs the unsupervised learning on the bag-of-features in a layer-by-layer manner to initialize a deep neural network and finally fine-tunes it with the tag labels. Through the experiment, we rigorously examine training choices and tuning parameters, and show that the model achieves high performance on Magnatagatune, a popularly used dataset in music auto-tagging.READ FULL TEXT VIEW PDF
In the recent past music has become ubiquitous as digital data. The scale of music collections that are readily accessible via online music services surpassed thirty million tracks111http://press.spotify.com/us/information/, accessed in Jan 23, 2015. The type of music content has been also diversified as social media services allow people to easily share their own original music, cover songs or other media sources. These significant changes in the music industry have prompted new strategies for delivering music content, for example, searching a large volume of songs with different query methods (e.g., text, humming or audio example) or recommending a playlist based on user preferences. A successful approach to these needs is using meta data, for example, finding similar songs based on analysis by music experts or collaborative filtering based on user data. However, the analysis by experts is costly and limited, given the large scale of available music tracks. User data are intrinsically biased by the popularity of songs or artists. As a way of making up for these limitations of meta data, the audio content itself have been exploited, i.e., by training a system to predict high-level information from the music audio files. This content-based approach has been actively explored in the area of music information retrieval (MIR). They are usually formed as an audio classification task that predicts a single label given categories (e.g. genre or emotion) or multiple labels in various aspects of music. The latter is often referred to as music annotation and retrieval, or simply called music auto-tagging.
These audio classification tasks are generally implemented through two steps; feature extraction and supervised learning. While the supervised learning step is usually handled by commonly used classifiers such as Gaussian mixture model (GMM) and support vector machines (SVM), the feature extraction step has been extensively studied based on domain knowledge. For example, Tzanetakis and Cook in their seminal work on music genre classification presented comprehensive signal processing techniques to extract audio features that represents timbral texture, rhythmic content and pitch content of music
. Specifically, they include low-level spectral summaries (e.g. centroid and roll-off), zero-crossings and mel-frequency cepstral coefficients (MFCC), a wavelet transform-based beat histogram and pitch/chroma histogram. McKinney and Breebaart suggested perceptual audio features based on psychoacoustic models, including estimates of roughness, loudness and sharpness, and auditory filterbank temporal envelopes. Similarly, a number of audio features have been proposed with different choices of time-frequency representations, psychoacoustic models and other signal processing techniques. Some of distinct audio features introduced in music classification include octave-based spectral contrast , Daubechies wavelet coefficient histogram , and auditory temporal modulation .
A common aspect of these audio features is that they are hand-engineered. In other words, individual computation steps to extract the features from audio signals are manually designed based on signal processing and/or acoustic knowledge. Although this hand-engineering approach has been successful to some degree, it has limitations in that, by nature, it may require numerous trial-and-error in the process of fine-tuning the computation steps. For this reason, many of previous work rather combine existing audio features, for example, by concatenating MFCC and other spectral features [6, 7, 8]
. However, they are usually heuristically chosen so that the combination can be redundant or still insufficient to explain music. Feature selection is a solution to finding an optimal combination but this is another challenge.
Recently there have been increasing interest in finding feature representations using data-driven learning algorithms, as an alternative to the hand-engineering approach. Inspired by research in computational neuroscience [10, 11]
, the machine learning community has developed a variety of learning algorithms that discover underlying structures of image or audio, and utilized them to represent features. This approach made it possible to overcome the limitations of the hand-engineering approach by learning manifold patterns automatically from data. This learning-based approach is broadly referred to asfeature learning or representation learning 
. In particular, hierarchical representation learning based on deep neural network (DNN) or convolutional neural network (CNN), calleddeep learning, achieved a remarkable series of successes in challenging machine recognition tasks, such as speech recognition  and image classification . The overview and recent work are reviewed in [12, 15]. The learning-based approach has gained great interest in the MIR community as well. Leveraging advances in the machine learning community, MIR researchers have investigated better ways of representing audio features and furthermore envisioned the approach as a general framework to build hierarchical music feature representations . In particular, the efforts have been made most actively for music genre classification or music auto-tagging. Using either unsupervised feature learning or deep learning, they have shown improved performance in the tasks.
In this paper, we present a two-stage learning model as an extension of our previous work . The first stage learns local features from multiple frames of audio spectra using sparse restricted Boltzmann machine (RBM) as before. However, we add an onset detection module to select temporally-aligned frames in training data. This is intended to decrease the variation in input space against random sampling. We show that this helps improving performance given the same condition. The second stage continues the bottom-up unsupervised learning by applying RBMs (but without sparsity) to the bag-of-features in a layer-by-layer manner. We use the RBM parameters to initialize a DNN and finally fine-tunes the network with the labels. We show that this pretraining improves the performance as observed in image classification or speech recognition tasks.
, we introduce datasets, evaluation metrics and experiment settings. In SectionV, we investigate the the evaluation results and compare them to those of state-of-the-arts algorithms in music auto-tagging. Lastly, we conclude by providing a summary of the work in Section VI.
In this section, we review previous work that exploited feature learning and deep learning for music classification and music auto-tagging. They can be divided into two groups, depending on whether the learning algorithm is unsupervised or supervised.
One group investigated unsupervised feature learning based on sparse representations, for example, using K-means[18, 17, 19, 20], sparse coding [21, 22, 23, 17, 24] and restricted Boltzmann machine (RBM) [25, 17]. The majority of them focused on capturing local structures of music data over one or multiple audio frames to learn high-dimensional single-layer features. They summarized the locally learned features as a bag-of-features (also called a bag-of-frames, for example, in ) and fed them into a separate classifier. The advantage of this single-layer feature learning is that it is quite simple to learn a large-size of feature bases and they generally provide good performance . In addition, it is easy to handle the variable length of audio tracks as they usually represent song-level features with summary statistics of the locally learned features (i.e. temporal pooling). However, this single-layer approach is limited to learning local features only. Some works worked on dual or multiple layers to capture segment-level features [28, 29]. Although they showed slight improvement by combining the local and segment-level features, learning hierarchical structures of music in an unsupervised way is highly challenging.
The second group used supervised learning that directly maps audio and labels via multi-layered neural networks. One approach was mapping single frames of spectrogram [30, 31, 32] or summarized spectrogram 
to labels via DNNs, where some of them pretrain the networks with deep belief networks[30, 31, 33]. They used the hidden-unit activations of DNNs as local audio features. While this frame-level audio-to-label mapping is somewhat counter-intuitive, the supervised approach makes the learned features more discriminative for the given task, being directly comparable to hand-engineered features such as MFCC. The other approach in this group used CNNs where the convolution setting can take longer audio frames and the networks directly predict labels [34, 35, 36, 37]
. CNNs has become the de-facto standard in image classification since the break-through in ImageNet challenge. As such, the CNN-based approach has shown great performance in music auto-tagging [35, 37]. However, in order to achieve high performance with CNNs, the model needs to be trained with a large dataset along with a huge number of parameters. Otherwise, CNNs is not necessarily better than the bag-of-features approach [19, 37].
Our approach is based on the bag-of-features in a single-layer unsupervised learning but extend it to a deep structure for song-level supervised learning. The idea behind this deep bag-of-features model is, while taking the simplicity and flexibility of the bag-of-features approach in unsupervised single-layer feature learning, improving the discriminative power using deep neural networks. Similar models were suggested using a different combination of algorithms, for example, K-means and multi-layer perceptrons (MLP) in[19, 20]. However, our proposed model performs unsupervised learning through all layers consistently using RBMs.
Figure 1 illustrates the overall data processing pipeline of our proposed model. In this section, we describe the individual processing blocks in details.
Musical signals are characterized well by note onsets and ensuing spectral patterns. We perform several steps of front-end processing to help learning algorithms effectively capture the features.
Musical signals are highly dynamic in amplitude. Being inspired by the dynamic-range compression mechanism in human ears, we control the amplitude as a first step. We adopt time-frequency automatic gain control which adjusts the levels of sub-band signals separately . Since we already showed the effectiveness in music auto-tagging , we use the automatic gain control as a default setting here.
We use mel-frequency spectrogram as a primary input representation. The mel-frequency spectrogram is computed by mapping 513 linear frequency bins from FFT to 128 mel-frequency bins. This mapping reduces input dimensionality sufficiently so as to take multiple frames as input data while preserving distinct patterns of spectrogram well.
The mel-frequency spectrogram is additionally compressed with a log-scale function, , where is the mel-frequency spectrogram and controls the degree of compression .
The local feature learning stage takes multiple frames as input data so that learning algorithms can capture spectro-temporal patterns, for example, sustaining or chirping harmonic overtones, or transient changes over time. We already showed that using multiple frames for feature learning improves the performance in music auto-tagging . We further develop this data selection scheme by considering where to take multiple frames on the time axis. In the previous work, we sampled multiple frames at random positions on the mel-frequency spectrogram without considering the characteristics of musical sounds. Therefore, given a single note, it could sample audio frames such that the note onset is located at arbitrary positions within the sampled frames or only sustain part of a note is taken. This may increase unnecessary variations or lose the chance of capturing important temporal dependency in the view of learning algorithm. In order to address this problem, we suggest sampling multiple frames based on the guidance of note onset. That is, we compute an onset detection function as a separate path and take a sample of multiple frames at the positions that the onset detection function has high values for a short segment. As illustrated in Figure 2, local spectral structures of musical sounds tend to be more distinctive when the onset strength is high. Sampled audio frames this way are likely to be aligned to each other with regard to notes, which may encourage learning algorithms to learn features more effectively. We term this sampling scheme onset-based sampling and will evaluate it in our experiment. The onset detection function is computed on a separate path by mapping the spectrogram on to 40 sub-bands and summing the half-wave rectified spectral flux over the sub-bands.
This stage first learns feature bases using the sampled data and learning algorithms. Then, it extracts the feature activations in a convolutional manner for each audio track and summarizes them as a bag-of-features using max-pooling and averaging.
PCA whitening is often used as a preprocessing step to remove pair-wise correlations (i.e. second-order dependence) or reduce the dimensionality before applying algorithms that capture higher-order dependencies . The PCA whitening matrix is computed by applying PCA to the sampled data and normalizing the output in the PCA space. Note that we locate PCA whitening as part of local feature learning in Figure 1 because the whitening matrix is actually “learned” by the sampled data.
Sparse RBM is the core algorithm that performs local feature learning in the bag-of-features model. In our previous work, we compared K-means, sparse coding and sparse RBM in terms of performance of music auto-tagging . Although there was not much difference, sparse RBM worked slightly better than others and the feed-forward computation for the hidden units in the RBM allows fast prediction in the testing phase. Thus, we focus only the sparse RBM here and more formally review the algorithm in the following paragraphs.
Sparse RBM is a variation of RBM which is a bipartite undirected graphical model that consists of visible nodes v and hidden nodes h . The visible nodes correspond to input vectors in a training set and the hidden nodes correspond to represented features. The basic form of RBM has binary units for both visible and hidden nodes, termed binary-binary
RBM. The joint probability ofv and h is defined by an energy function :
where b and c are bias terms, and W is a weight matrix. The normalization factor is called the partition function, which is obtained by summing all possible configurations of v and h. For real-valued data such as spectrogram, Gaussian units are frequently used for the visible nodes. Then, the energy function in Equation 1 is modified to:
where the additional quadratic term,
is associated with covariance between input units assuming that the Gaussian units have unit variances. This form is calledGaussian-binary RBM .
The RBM has symmetric connections between the two layers but no connections within the hidden nodes or visible nodes. This conditional independence makes it easy to compute the conditional probability distributions, when nodes in either layer are observed:
where is the logistic function and
is the Gaussian distribution. These can be directly derived from Equation1 and 2.
The model parameters of RBM are estimated by taking derivative of the log-likelihood with regard to each parameter and then updating them using gradient descent. The update rules for weight matrix and bias terms are obtained from Equation 1:
where is the learning rate and the angle brackets denote expectation with respect to the distributions from the training data and the model. While can be easily obtained, exact computation of is intractable. In practice, the learning rules in Equation 6 converges well only with a single iteration of block Gibbs sampling when it starts by setting the states of the visible units to the training data. This approximation is called the contrastive-divergence .
This parameter estimation is solely based on maximum likelihood and so it is prone to overfitting to the training data. As a way of improving generalization to new data , the maximum likelihood estimation is penalized with additional terms called weight-decay. The typical choice is norm, which is half of the sum of the squared weights. Taking the derivative, the weight update rule in Equation 6 is modified to:
where is called weight-cost and controls the strength of the weight-decay.
Sparse RBM is trained with an additional constraint on the update rules, which we call sparsity. We impose the sparsity on hidden units of a Gaussian-binary RBM based on the technique in . The idea is to add a penalty term that minimizes a deviation of the mean activation of hidden units from a target sparsity level. Instead of directly applying the gradient descent to that, they exploit the contrastive-divergence update rule and so simply added it to the update rule of bias term
. This controls the hidden-unit activations as a shift term of the sigmoid function in Equation4. As a result, the bias update rule in Equation 8 is modified to:
where is the training set, determines the target sparsity of the hidden-unit activations and controls the strength.
Once we train a sparse RBM from the sampled data, we fix the learned parameters and extract the hidden-unit activations in a convolutional manner for an audio track. Following our previous work, we summarize the local features via max-pooling and averaging. Max-pooling has proved to be an effective choice to summarize local features [34, 19]. It works as a form of temporal masking because it discards small activations around high peaks. We further summarize the max-pooled feature activations with averaging. This produces a bag-of-features that represents a histogram of dominant local feature activations.
This stage performs supervised learning to predict tags from the bag-of-features. Using a deep neural network (DNN), we build a deep bag-of-features
representation that maps the complex relations between the summarized acoustic features and semantic labels. We configure the DNN to have up to three hidden layers and rectified linear units (ReLUs) for the nonlinear function as shown in Figure3. The ReLUs have proved to be highly effective in the DNN training when used with the dropout regularization [45, 46, 32], and also much faster than other nonlinear functions such as sigmoid in computation. We first pretrain the DNN by a stack of RBMs and then fine-tune it using tag labels. The output layer works as multiple independent binary classifiers. Each output unit corresponds to a tag label and predicts whether the audio track is labeled with it or not.
Pretraining is an unsupervised approach to better initialize the DNN . Although recent advances have shown that pretraining is not necessary when the number of labeled training samples is sufficient [14, 47], we conduct the pretraining to verify the necessity in our experiment setting.
We perform the pretraining by greedy layer-wise learning of RBMs with ReLUs to make learned parameters compatiable with the nonlinearity in the DNN. The ReLUs in the RBMs can be viewed as the sum of an infinite number of binary units that share weights and have shifted versions of the same bias . This can be approximated to a single unit with the nonlinearity. Furthermore, Gibbs sampling for the ReLUs during the training can be performed by taking samples from where is Gaussian noise with zero mean and variance . We use the ReLU for both visible and hidden nodes of the stacked RBMs. However, for the bottom RBM that takes the bag-of-features as input data, we use binary units for visible nodes and ReLU for hidden notes to make them compatible with the scale of the bag-of-features.
After initializing the DNN with the weight and bias learned from the RBMs, we fine-tune them with tag labels using the error back-propagation. We predict the output by adding the output layer (i.e. weight and bias) to the last hidden layer and taking the sigmoid function to define the error as cross-entropy between the prediction and ground truth for bag-of-features and tag :
We update a set of parameters using AdaDelta. The method requires no manual tuning for the learning rate and is robust to noisy gradient information and variations in model architecture . In addition, we use dropout, a powerful technique that improves the generalization error of large neural networks by setting zeros to hidden and input layers randomly . We find AdaDelta and dropout essential to achieve good performance.
In the section, we introduce the dataset and evaluation metrics used in our experiments. Also, we describe experiment settings for the proposed model.
We use the Magnatagatune dataset, which contains 29-second MP3 files with annotations collected from an online game . The dataset is the MIREX 2009 version used in [34, 35]. It is split into 14660, 1629 and 6499 clips, respectively for training, validation and test, following the prior work. The clips are annotated with a set of 160 tags.
We first convert the MP3 files to the WAV format and resample them to 22.05kHz. We then compute their spectrogram with a 46ms Hann window and 50% overlap, on which the time-frequency automatic gain control using the technique in  is applied. This equalizes the spectrogram using spectral envelopes computed over 10 sub-bands. We convert the equalized spectrogram to mel-frequency spectrogram with 128 bins and finally compress the magnitude by fixing the strength to 10.
The first step in this stage is to train PCA (for whitening) and sparse RBM. Each training sample is a spectral block comprised of multiple consecutive frames from the mel-frequency spectrogram. We gather training data (total 200,000 samples) by taking one spectral block every second at a random position or using the onset detection function. The number of frames in the spectral block varies from 2, 4, 6, 8 to 10 and we evaluate them separately. We obtain the PCA whitening matrix retaining 90% of the variance to reduce the dimensionality and then train the sparse RBM with a learning rate of 0.03, a hidden-layer size of 1024 and different values of target sparsity from 0.007, 0.01, 0.02 to 0.03. Once we learn the PCA whitening matrix and RBM weight, we extract hidden-unit activations from an audio track in a convolutional manner and summarize them into a bag-of-features with max-pooling over segments with 0.25, 0.5, 1, 2 and 4 seconds.
Since this stage creates a large number of possibilities in obtaining a bag-of-features, we reduce the number of adjustable parameters before proceeding with song-level supervised learning. Among others, we fix the number of frames in the spectral block and data sampling scheme, which are related to collecting the sample data. We find a reasonable setting for them using a simple linear classifier that minimizes the same cross-entropy in Equation 11
(i.e. logistic regression).
We first pretrain the DNN with RBMs and then fine-tune the networks. We fix the hidden-layer size to 512 and adjust the number of hidden layers from 1 to 3 to verify the effect of larger networks. In training ReLU-ReLU RBMs, we set the learning rate to a small value (0.003) in order to avoid unstable dynamics in the weight updates . We also adjust the weight-cost in training RBMs from 0.001, 0.01 to 0.1, separately for each hidden layer. We fine-tune the pretrained networks, using Deepmat, a Matlab library for deep learning222https://github.com/kyunghyuncho/deepmat. This library includes an implementation of AdaDelta and dropout, and supports GPU processing. In order to validate the proposed model, we compare it to DNNs with random initialization and also the same model but with ReLU units for the visible layer of the bottom RBM333Our experiment code is available at https://github.com/juhannam/deepbof
In this section, we examine training choices and tuning parameters in the experiments, and finally compare them to state-of-the-art results.
We compare random sampling with onset-based sampling in the context of finding an optimal number of frames in local feature learning. In order to prevent the experiment from being too exhausting, we chose logistic regression as a classifier instead of the DNN. Figure 4 shows the evaluation results. In random sampling, the AUC increases up to 6 frames and then slowly decays. A similar trend is shown in onset-based sampling. However, the AUC saturates in a higher level, indicating that onset-based sampling is more effective for the local feature learning. In the following experiments, we fix the number of frames to 8 as it provides the highest AUC in terms of median.
Figure 5 shows the evaluation results for different numbers of hidden layers when the DNN is randomly initialized or pretrained with ReLU RBMs. When the networks has a single hidden layer, there is no significant difference in AUC level. As the number of hidden layers increases in the DNN, however, pretrained networks apparently outperform randomly initialized networks. This result is interesting, when recalling recent observations that pretraining is not necessary when the number of labeled training samples is sufficient. Thus, the result may indicate that the size of labeled data is not large enough in our experiment. However, we need to note that the auto-tagging task is formed as a multiple binary classification problem, which is different from choosing one label exclusively, and furthermore the levels of abstraction in the tag labels are not homogenous (e.g. including mood and instruments). In addition, there is some recent work that pretraining is still useful .
Figure 6 shows the evaluation results for different target sparsity and max-pooling sizes in the bag-of-features when we use a pretrained DNN with three hidden layers. The best results are achieved when target sparsity is 0.02 and max-pooling size is 1 or 2 second. Compared to our previous work , the optimal target sparsity has not changed whereas the optimal max-pooling size is significantly reduced. Considering we used 30 second segments in the Maganatagatune dataset against the full audio tracks in the CAL500 datasets (typically 3-4 minute long), the optimal max-pooling size seems to be proportional to the length of audio tracks.
We adjust weight-cost in training the RBM with three different values. Since this exponentially increases the number of networks to train as we stack up RBMs, a brute-force search for an optimal setting of weight-costs becomes very time-consuming. For example, when we have three hidden layers, we should fine-tune 27 different instances of pretrained networks. From our experiments, however, we observed that the good results tend to be obtained when the bottom layer has a small weight-cost and upper layers have progressively greater weight-costs. In order to validate the observation, we plot the statistics of AUC for a given weight-cost at each layer in Figure 7. For example, the left-most boxplot is computed from all combinations of weight-costs when the weight-cost in the first-layer RBM (L1) is fixed to 0.001 (this includes 9 combinations of weight-cost for three hidden layers. We count them for all different target sparsity and max-pooling size). For the first layer, the AUC goes up when the weight-cost is smaller. However, the trend becomes weaker through the second layer (L2) and goes opposite for the third layer (L3); the best AUC in median is obtained when the weight-cost is 0.1 for the third layer, even though the difference it slight. This result implies that it is better to encourage “maximum likelihood” for the first layer by having a small weight-cost and regulate it for upper layers by having progressively greater weight-costs. This is plausible when considering the level of abstraction in the DNN that goes from acoustic feature summaries to semantic words.
Based on this observation, we suggest a special condition for the weight-cost setting to reduce the number of pretraining instances. That is, we set the weight-cost to a small value (=0.001) for the first layer and an equal or increasing value for upper layers. Figure 8 compares the special condition denoted as “WC_Inc” to the best result and fixed settings for all layers. “WC_Inc” achieves the best result in three out of four and it always outperforms the three fixed setting. This shows that, with the special condition for the weight-cost setting, we can save significant amount of training time while achieving high performance.
We lastly compare our proposed model to previous state-of-the-art algorithms in music auto-tagging. Since we use the MIREX 2009 version of Magnatagatune dataset for which Hamel et. al. achieved the best performance [34, 35], we place their evaluation results only in Table I. They also used deep neural networks with a special preprocessing of mel-frequency spectrogram. However, our deep bag-of-features model outperforms them for all evaluation metrics.
We presented a deep bag-of-feature model for music auto-tagging. The model learns a large dictionary of local feature bases on multiple frames selected by onset-based sampling and summarizes an audio track as a bag of learned audio features via max-pooling and averaging. Furthermore, it pre-trains and fine-tunes the DNN to predict the tags. The deep bag-of-feature model can be seen as a special case of deep convolutional neural networks as it has a convolution and pooling layer, where the local features are extracted and summarized, and has three fully connected layers. As future work, we will move on more general CNN models used in computer vision and train them with large-scale datasets.
|Multi PMSCs ||0.870||0.949||0.481||0.339||0.263||0.216||0.184|
This work was supported by Korea Advanced Institute of Science and Technology (Project No. G04140049).
Y. Panagakis, C. Kotropoulos, and G. R. Arce, “Non-negative multilinear principal component analysis of auditory temporal modulations for music genre classification,”IEEE Transaction on Audio, Speech and Language Processing, 2010.
A. van den Oord, S. Dieleman, and B. Schrauwen, “Transfer learning by supervised pre-training for audio-based music classification,” inProceedings of the 15th International Conference on Music Information Retrieval (ISMIR), 2014.