Automating the feature extraction is currently an active research field aiming to learn enhanced representations directly from the raw data rather than handcrafting them. Neural Network based architectures have been used in this regard for image recognition and sound . The adoption of these architectures to sound recognition usually occurs after they gain wide acceptance in other application domains such as image recognition. For example, stacked Restricted Boltzmann Machines (RBM)  forming a Deep Belief Net (DBN) to extract features were initially introduced to showcase the capability of these stacked generative layers to be used as a dimensionality reduction technique when applied on images of handwritten digits. Later, Hamel et al. 37] for a music genre classification task. Convolutional Neural Networks (CNN) as well were initially introduced in the work of LeCun et al.  for images, and later attempts followed to use it for sound [28, 11, 30]
Despite the success of these architectures for images, they are not designed to exploit the time-frequency representation of sound efficiently. For example, DBNs ignore the inter-frames relation by treating a spectrogram’s frame in isolation from neighboring frames, and CNNs depend on weight sharing, which does not preserve the spatial locality of the learned features.
The ConditionaL Neural Networks (CLNN)  and the Masked ConditionaL Neural Networks (MCLNN)  are designed to preserve the spatial locality of the learned features, where there is a dedicated link for every feature in a feature vector compared to the weight sharing using the CNN. The CLNN preserve the temporal relation between the frames by considering a window rather than the isolated frame used in the RBM, and the mask in the MCLNN enforces a systematic sparseness over the network’s links. The mask design follows a band-like pattern, which allows the network to be frequency shift-invariant mimicking a filterbank. Additionally, the mask explores several feature combinations concurrently analogous to handcrafting the optimum combination of features through a mix-and-match operation, while preserving the spatial locality of the features.
2 Related Models
The Conditional Restricted Boltzmann Machine (CRBM)  by Taylor et al. extended the RBM to the temporal dimension to allow an RBM to learn about a temporal window of frames rather than being trained on static bag-of-frames. To fulfill this aim, the CRBM adapted conditional links to capture the influence of the previous frames on the current one. Fig. 2 shows a CRBM layer, where the normal RBM is represented with the bidirectional connections going across the visible vector and the hidden nodes . The links in the figure represent the conditional links from the previous visible vectors () to the hidden layer . Similarly, the links capture the autoregressive relation from the previous visible vectors to the current one . Layers of a CRBM can be stacked over each other similar to a DBN, where Taylor et al. trained a CRBM to model the human motion over a multichannel signal of human joints activity. Mohamed et al. 
extended the CRBM with the Interpolating Conditional Restricted Boltzmann Machine (ICRBM), which showed an enhanced performance by including the influence of the future frames in addition to the past ones for phoneme recognition. The work of Battenberg et al. was another attempt to use the CRBM for sound, where they used the CRBM to analyze drum patterns.
Similar modifications were introduced to the CNN to fit the time-frequency representation. The CNN architecture, shown in Fig. 2, is based on the two primary operations: convolution and pooling. The convolution operation scans the 2-dimensional representation with a small weight matrix (or filter), e.g.
, where a form of a weighted sum is generated from the element-wise multiplication between the filter and the region of the image being scanned. The output of each step of the filter is a scalar value positioned in a new representation of the image known as the feature map. The convolutional layer generates several feature maps. The number of feature maps matches the number of filters used. Mean or max pooling follows the convolution to reduce the resolution of the feature maps. These two operations are consecutively repeated to form a deep architecture of a CNN, where the output of the final layer is flattened to a single feature vector to be fed to a fully connected neural network for the final classification. CNN depends on weight sharing, which performs well in favor of large images without the need to have a dedicated weight going across each pixel and the network’s hidden layer. Weight sharing does not preserve the spatial locality of the learned features, which is practical for images, but not for time-frequency representation. This is related to the influence of the location of the detected feature at a specific frequency as a property to distinguish between sounds. The work of Abdel-Hamid et al. approached this problem by redesigning the convolutional filters to operate over bands. Another attempt was in , where they proposed using separate filters to convolve each of the time and frequency dimensions separately combined in the same model.
The Masked ConditionaL Neural Network (MCLNN) was introduced in  with an analysis of the influence of the data split on model accuracy. In this work, we further evaluate the MCLNN performance on the music genre classification task.
3 Conditional Neural Networks
The ConditionaL Neural Network (CLNN)  is a discriminative model that extends from the generative Conditional Restricted Boltzmann Machine (CRBM)  discussed earlier. The CLNN adapts the conditional previous visible to hidden links proposed in the CRBM, and it further extends the connections to the future frames as presented in the ICRBM .
The CLNN is formed of a vector shaped hidden layer, similar to a conventional multi-layer perceptron, havinge dimensions. The input layer accepts a number of frames in a window of size d, where the window’s middle frame is conditioned on the past and future frames. The width of the window follows (1)
where the 1 refers to the window’s central frame and the n frames refer to the neighboring frames to the middle one (2 is to account for the past and future directions). There are dense connections between each vector in the input window and the hidden layer. Accordingly, there are
weight matrices forming a tensor. The weight tensor dimensions are [feature vector lengthl, hidden layer width e, window’s depth d]. Each vector of length l in the input window of size d has a corresponding dedicated weight matrix in the weight tensor. The new vectors generated from the vector-matrix multiplication between each feature vector and its corresponding weight matrix are summed together feature-wise before applying a nonlinear transformation. The activation of a hidden node is given in (2)
where is the activation at node of the hidden layer for the window’s middle frame at index t of the segment. The segment, discussed later in detail, is a chunk of frames of a minimum size equal to the window. is the transfer function and is the bias at the node. is the feature of the feature vector x. u refers to the index within the window and t refers to the window’s middle frame (having u=0 in the window), which is at the same time the index of the middle frame in the input segment. is the weight between the feature of the vector at position u in the window and the neuron in the hidden layer. u is the index of a frame in the window and also the index of its corresponding weight matrix in the weight tensor. The hidden layer activation can be reformulated in a vector form in (3).
where the hidden layer activation vector for the window’s middle frame conditioned on the n neighboring frames in either direction is given by the transfer function
, the bias vectorand the multiplication between the feature vector at index and its corresponding weight matrix at the same index. The number of matrices in the weight tensor is equal to matching the number of frames in the window, where each frame is processed by its dedicated matrix. The conditional distribution is formulated in , where is a logistic function such as a Sigmoid or the output layer Softmax.
Fig. 3 shows two CLNN layers of order followed by a global pooling layer  that aggregates the features over extra frames before feeding them to a fully connected network for classification. Each CLNN layer consumes 2n frames generating a fewer number of frames. Accordingly, a CLNN is trained over segments of size following (4)
where is the segment size, the order is for the number of frames in a single direction (the 2 is for the past and future frames ), is the number of layers, and is for the extra frames to be pooled across beyond the CLNN layers. For example, at , and , a segment of size frames is presented at the input of the first CLNN layer. The second CLNN layer will receive vectors at its input and consequently will generate vectors as an output. Similarly, the third layer will generate vectors, which undergo flattening or pooling to a single vector before the fully-connected layers.
4 Masked Conditional Neural Networks
Spectrograms represent the energy at different frequency bins as the signal progresses through time. Despite the usefulness of such representations for signal analysis, they are susceptible to the frequency shifts, which could provide different spectral representations for very similar sounds. Frequency shift involves a smearing in the energy of a frequency bin across nearby bins due to uncontrolled factors affecting the signal propagation. Filterbanks tackle the frequency shifts in raw spectrograms. A filterbank is a group of filters used to subdivide the spectrograms into frequency bands allowing the new representation to be frequency shift-invariant. They are the principal operating component of Mel-scaled transformations such as the MFCC. The Masked ConditionaL Neural Networks (MCLNN)  embed a filterbank-like behaviour within the network by enforcing a systematic sparseness over the network’s links that follows a band-like pattern.
The mask design is controlled by two tunable hyper-parameters: the Bandwidth and the Overlap. Fig. 4.a. shows a masking pattern with a Bandwidth of 5 and an Overlap of 3. The Bandwidth values refer to the successive 1’s in a column, and the Overlap refers to the superposition of the patterns between one column and another. Fig. 4.b. depicts the active connections following the mask in Fig. 4.a. Each neuron in the hidden layer of Fig. 4.b. has a focused spatial region of the feature vector to observe. Fig. 4.c. shows a mask with a negative overlap depicting the non-overlapping distance between two columns. The linear indexing of the binary values of a mask is formulated in (5)
where the linear index is given by the bandwidth , the overlap and the feature vector length . takes the values in and is in the interval . The mask plays another role of exploring a range of feature combinations analogous handcrafting the optimum feature combinations. This operation is applied in the MCLNN for several feature combinations concurrently as shown in Fig. 4.c., where the 2nd set of three columns holds a shifted version of the 1st three columns and similarly for the 3rd set. In a closer analysis, each hidden node (mapped to a column in the mask) will have a different input to observe. For example, the input at the 1st node is the first three features of the feature vector, the 4th node’s input is the first two features, and the 7th node is the first feature. The masking is applied through an element-wise multiplication following (6).
where is the original weight matrix at index , is the masking pattern and is the masked weight matrix to replace the original one in (3).
Fig. 5 shows a single MCLNN step, where a window of frames of size is processed with a matching count of matrices. Each frame in the window has a corresponding matrix to process. The vector-matrix multiplication generates d new vectors, which are summed feature-wise before applying the nonlinearity by a transfer function. The output of a single step over the window is a resultant single frame. The highlighted cells in each matrix depict the active links enforced through the mask.
The Ballroom dataset is composed of 698 music clips of 30 seconds each, unevenly partitioned across 8 music genres: Cha Cha (CC), Jive (Ji), Quickstep (Qs), Rumba (Ru), Samba (Sa), Tango (Ta), Viennese Waltz (VW) and Slow Waltz (SW).
The Homburg dataset contains 1886 music clips of 10 seconds each, distributed across 9 classes: Alternative (Al), Blues (Bl), Electronic (El), FolkCountry (FC), FunkSoulRnb (FS), Jazz (Ja), Pop (Po), RapHiphop (RH) and Rock (Ro).
All files for both datasets were transformed to a logarithmic mel-scaled spectrogram of 256 bin using an FFT of 2048 and 1024 hop size. Segments were extracted following (4
) and the z-score parameters of the training data were used to standardize the testing and validation sets. Experiments were carried out using a 10-folds cross-validation with the mean accuracy across the folds reported. The hyper-parameters used for the MCLNN are listed in Table1.
The two MCLNN layers are followed by a global single dimension pooling layer to pool feature-wise over the extra frames. The global pooling emulates the aggregation over a musical texture window, which was studied by Bergstra et al. . We used and for the Ballroom and the Homburg, respectively. Two densely connected layers of 50 and 10 nodes followed the global pooling layer, before the final Softmax. The model was trained using ADAM  to minimize the categorical cross-entropy between the predicted vector and the target label. Dropout 
was used as a regularizer. The final decision of the clip’s category is decided using probability voting across the frames of the clip.
Classifier and Features
|SVM + 28 feature,Tempo ||96.13|
|KNN + Modulation Scale Spec. ||93.12|
|Manhattan Dist. + Block-Level feat. ||92.44|
|MCLNN + MelSpec.(this work)||90.40|
|SVM + Rhyth.,Hist.,Stat.,Onset,etc||90.40|
|KNN + 15 MFCC-like desc.,Tempo||90.10|
|KNN + Rhythm and Timbre||89.20|
|SVM + 28 features without Tempo ||88.00|
|CNN+ Mel-Scaled Spectrogram||87.68|
|SVM + Rhyth.,Hist.,Statist. ||84.20|
|KNN + Tempo ||82.30|
Classifier and Features
|JSLRR+Cortrical Representations ||63.46|
|MCLNN + MelSpec. (this work)||61.45|
|KNN + Rhythm and Timbre||57.00|
|SVN + Novelty Functions ||51.10|
As listed in Table 3 and Table 3, MCLNN achieved an accuracy of 90.4% and 61.45% on the Ballroom and the Homburg, respectively, which surpasses several neural network based architectures in addition to hand-crafted attempts on both datasets. MCLNN achieved the mentioned accuracies without a special design to exploit musical perceptual properties compared to other attempts. In the work of Peeters , he achieved on the Ballroom using the Tempo annotations released with the dataset. Peeters reapplied his proposed handcrafted features without Tempo data, and the accuracy was , which shows the influence of the tempo annotations. In a similar type of analysis, Gouyon et al. used the Tempo annotations as a baseline to benchmark their proposed handcrafted features, where the Tempo annotations alone achieved and their proposed features with the Tempo achieved . The work of Marchand et al.  achieved using multiple processing stages including on-set energy calculation, autocorrelation, modulation scale spectra and dimensionality reduction to exploit rhythmic pattern in a music clip. Seyerlehner et al.  achieved using several features extracted from blocks of the spectrogram. A neural network based attempt in the work of Pons et al.  achieved using a shallow CNN architecture with pre-trained filters convolving the time and spectral dimensions separately in the same model. Handcrafted features for the Homburg dataset has been explored as well. The work of Panagakis et al.  achieved using the auditory cortical representations in combination with their introduced classifier. Their work reports the accuracy achieved on the Ballroom dataset using the same features (cortical representations) and the classifier used for the Homburg, where they achieved on the Ballroom dataset. The work in  achieved on the Homburg dataset using auditory cortical representations, MFCC and Chroma as features. A neural network based attempt on the Homburg dataset in the work of Schluter et al.  achieved using mcRBM , a variant of the RBM, applied on a mel-spectrogram.
show the confusion matrix for Ballroom and the Homburg datasets, respectively. High confusion is noticed for the Rumba and the Waltz genres with the Slow Waltz, which overlap with the findings in. For the Homburg dataset, less confusion is noticed with the availability of more samples in the genre category.
6 Conclusions and Future work
In this work, we have explored the applicability of the ConditionaL Neural Network (CLNN) and the Masked ConditionaL Neural Network (MCLNN) on the music genre classification task. The CLNN preserves the inter-frames relation of a temporal signal and the spatial locality of the features. The MCLNN extends the CLNN by enforcing a systematic sparseness over the network’s links following a band-like pattern, which mimics a filterbank. The filterbank-like pattern induces the network to learn in frequency bands. The mask also automates the exploration of several feature combinations concurrently, which is usually a manual process of handcrafting the optimum feature combinations. The MCLNN has achieved competitive accuracies on the Ballroom and the Homburg music datasets compared to several handcrafted attempts, in addition to state-of-the-art Convolutional Neural Networks. The MCLNN has achieved these accuracies without depending on any musical perceptual properties used in several hand-crafted attempts, which allow the MCLNN to generalize to other types of multi-dimensional temporal signals. Future work, we will consider using deeper MCLNN architectures with more optimization to the masking patterns used, in addition to using different orders across the layers. We will also explore applying the MCLNN to multi-dimensional temporal representations other than spectrograms.
This work is funded by the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement no. 608014 (CAPACITIE).
-  Abdel-Hamid, O., Mohamed, A.R., Jiang, H., Deng, L., Penn, G., Yu, D.: Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech and Language Processing 22(10), 1533–1545 (2014)
-  Aryafar, K., Shokoufandeh, A.: Music genre classification using explicit semantic analysis. In: International ACM workshop on Music Information Retrieval With User-Centered and Multimodal Strategies (MIRUM) (2011)
Battenberg, E., Wessel, D.: Analyzing drum patterns using conditional deep belief networks. In: International Society for Music Information Retrieval, ISMIR (2012)
Bergstra, J., Casagrande, N., Erhan, D., Eck, D., Kégl, B.: Aggregate features and adaboost for music classification. Machine Learning 65(2-3), 473–484 (2006)
Fahlman, S.E., Hinton, G.E., Sejnowski, T.J.: Massively parallel architectures for al: Netl, thistle, and boltzmann machines. In: National Conference on Artificial Intelligence, AAAI (1983)
-  Gouyon, F., Klapuri, A., Dixon, S., Alonso, M., Tzanetakis, G., Uhle, C., Cano, P.: An experimental comparison of audio tempo induction algorithms. IEEE Transactions on Audio, Speech and Language Processing 14(5), 1832–1844 (2006)
-  Gouyon, F., Dixon, S., Pampalk, E., Widmer, G.: Evaluating rhythmic descriptors for musical genre classification. In: International AES conference (2004)
-  Hamel, P., Eck, D.: Learning features from music audio with deep belief networks. In: International Society for Music Information Retrieval Conference, ISMIR (2010)
-  Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–7 (2006)
-  Homburg, H., Mierswa, I., Moller, B., Morik, K., MichaelWurst: A benchmark dataset for audio classification and clustering. In: International Symposium on Music Information Retrieval (2005)
Kereliuk, C., Sturm, B.L., Larsen, J.: Deep learning and music adversaries. IEEE Transactions on Multimedia 17(11), 2059–2071 (2015)
-  Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: International Conference for Learning Representations, ICLR (2015)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Neural Information Processing Systems, NIPS (2012)
-  LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
-  Lidy, T., Rauber, A.: Evaluation of feature extractors and psycho-acoustic transformations for music genre classification. In: International Conference on Music Information Retrieval, ISMIR (2005)
-  Lidy, T., Rauber, A., Pertusa, A., Inesta, J.M.: Improving genre classification by combination of audio and symbolic descriptors using a transcription system. In: International Conference on Music Information Retrieval (2007)
-  Lin, M., Chen, Q., Yan, S.: Network in network. In: International Conference on Learning Representations, ICLR (2014)
-  Lykartsis, A., Lerch, A.: Beat histogram features for rhythm-based musical genre classification using multiple novelty functions. In: Conference on Digital Audio Effects (DAFx-15) (2015)
-  Marchand, U., Peeters, G.: The modulation scale spectrum and its application to rhythm-content description. In: International Conference on Digital Audio Effects (DAFx) (2014)
-  Medhat, F., Chesmore, D., Robinson, J.: Masked conditional neural networks for audio classification. In: International Conference on Artificial Neural Networks (ICANN) (2017)
-  Moerchen, F., Mierswa, I., Ultsch, A.: Understandable models of music collections based on exhaustive feature generation with temporal statistics. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) (2006)
-  Mohamed, A.R., Hinton, G.: Phone recognition using restricted boltzmann machines. In: IEEE International Conference on Acoustics Speech and Signal Processing, ICASSP (2010)
Osendorfer, C., Schluter, J., Schmidhuber, J., Smagt, P.v.d.: Unsupervised learning of low-level audio features for music similarity estimation. In: Workshop on Speech and Visual Information Processing in conjunction with the International Conference on Machine Learning (ICML) (2011)
-  Panagakis, Y., Kotropoulos, C.: Music classification by low-rank semantic mappings. EURASIP Journal on Audio Speech and Music Processing 2013, 1–15 (2013)
-  Panagakis, Y., Kotropoulos, C.L., Arce, G.R.: Music genre classification via joint sparse low-rank representation of audio features. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22(12), 1905–1917 (2014)
-  Peeters, G.: Spectral and temporal periodicity representations of rhythm for the automatic classification of music audio signal. IEEE Transactions on Audio, Speech, and Language Processing 19(5), 1242–1252 (2011)
-  Pohle, T., Schnitzer, D., Schedl, M., Knees, P., Widmer, G.: On rhythm and general music similarity. In: International Society for Music Information Retrieval, ISMIR (2009)
-  Pons, J., Lidy, T., Serra, X.: Experimenting with musically motivated convolutional neural networks. In: International Workshop on Content-based Multimedia Indexing, CBMI (2016)
Ranzato, M., Hinton, G.E.: Modeling pixel means and covariances using factorized third-order boltzmann machines. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2551–2558 (2010)
-  Salamon, J., Bello, J.P.: Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Processing Letters 24(3) (2017)
-  Schluter, J., Osendorfer, C.: Music similarity estimation with the mean-covariance restricted boltzmann machine. In: International Conference on Machine Learning and Applications, ICMLA. pp. 118–123 (2011)
-  Seyerlehner, K., Schedl, M., Pohle, T., Knees, P.: Using block-level features for genre classification, tag classification and music similarity estimation. In: Music Information Retrieval eXchange, MIREX (2010)
-  Seyerlehner, K., Widmer, G.: Fusing block-level features for music similarity estimation. In: International Conference on Digital Audio Effects (DAFx-10) (2010)
-  Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, JMLR 15, 1929–1958 (2014)
-  Taylor, G.W., Hinton, G.E., Roweis, S.: Modeling human motion using binary latent variables. In: Advances in Neural Information Processing Systems, NIPS. pp. 1345–1352 (2006)
-  Tzanetakis, G., Cook, P.: Musical genre classification of audio signals. IEEE Transactions On Speech And Audio Processing 10(5) (2002)
-  Vapnik, V., Lerner, A.: Pattern recognition using generalized portrait method. Automation and Remote Control 24, 774–780 (1963)