Masked Conditional Neural Networks for Audio Classification

03/06/2018 ∙ by Fady Medhat, et al. ∙ University of York 0

We present the ConditionaL Neural Network (CLNN) and the Masked ConditionaL Neural Network (MCLNN) designed for temporal signal recognition. The CLNN takes into consideration the temporal nature of the sound signal and the MCLNN extends upon the CLNN through a binary mask to preserve the spatial locality of the features and allows an automated exploration of the features combination analogous to hand-crafting the most relevant features for the recognition task. MCLNN has achieved competitive recognition accuracies on the GTZAN and the ISMIR2004 music datasets that surpass several state-of-the-art neural network based architectures and hand-crafted methods applied on both datasets.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The success of the deep neural network architectures in image recognition [1] induced applying these models for audio recognition [2][3]. One of the main drivers for the adaptation is the need to eliminate the effort invested in hand-crafting the features required for classification. Several neural networks based architectures have been proposed, but they are usually adapted to sound from other domains such as image recognition. This may not exploit sound related properties. The Restricted Boltzmann Machine (RBM)[4]

treats sound as static frames ignoring the inter-frame relation and the weight sharing in the vanilla Convolution Neural Networks (CNN)

[5] does not preserve the spatial locality of the learned features, where limited weight sharing was proposed in [2] in an attempt to tackle this problem for sound recognition.

Figure 1: Conditional Restricted Boltzmann Machine

The Conditional Restricted Boltzmann Machine (CRBM) [6] in Fig. 1 extends the RBM [7] to the temporal dimension. This is applied by including conditional links from the previous frames to both the hidden nodes and the current visible nodes using the links and the autoregressive links , respectively as depicted in Fig. 1

. The Interpolating CRBM (ICRBM)

[8] achieved a higher accuracy compared to the CRBM for speech phoneme recognition by extending the CRBM to consider both the previous and future frames.

The CRBM behavior (and similarly this work) overlaps with that of a Recurrent Neural Network (RNN) such as the Long Short-Term Memory (LSTM)

[9], an architecture designed for sequence labelling. The output of an RNN at a certain temporal instance depends on the current input and the the hidden state of the network’s internal memory from the previous input. Compared to an LSTM, a CRBM does not require an internal state, since the influence of the previous temporal input states is concurrently considered with the current input. Additionally, increasing the order

does not have the consequence of the vanishing or exploding gradient related to the Back-Propagation Through Time (BPTT) as in recurrent neural networks that LSTM was introduced to solve, since the back-propagation in a CRBM depends on the number of layers as in normal feed-forward neural networks.

Inspired by the human visual system, the Convolutional Neural Network (CNN) depends on two main operations namely the convolution and pooling. In the convolutional operation, the input (usually a 2-dimensional representation) is scanned (convolved) by a small-sized weight matrix, referred to as a filter. Several small sized filters, e.g.

, scan the input to generate a number of feature maps equal to the number of filters scanning the input. A pooling operation generates lower resolution feature maps, through either a mean or a max pooling operation. CNN depends on weight sharing that allows applying it to images of large sizes without having a dedicated weight for each pixel, since similar patterns may appear at different locations within an image. This is not optimally suitable for time-frequency representations, which prompted attempts to tailor the CNN filters for sound


2 Conditional Neural Networks

Figure 2: Two CLNN layers with .

In this work, we introduce the ConditionaL Neural Network (CLNN). The CLNN adaptes from the Conditional RBM the directed links between the previous visible and the hidden nodes and extends to future frames as in the ICRBM. Additionally, the CLNN adapts a global pooling operation [12], which behaves as an aggregation operation found to enhance the classification accuracy in [13]

. The CLNN allows the sequential relation across the temporal frames of a multi-dimensional signal to be considered collectively by processing a window of frames. The CLNN has a hidden layer in the form of a vector having e neurons, and it accepts an input of size

, where is the feature vector length and ( is the number of frames in a window, is the order for the number of frames in each temporal direction and 1 is for the window’s middle frame). Fig. 2 shows two CLNN layers each having an order , where

is a tunable hyper-parameter to control the window’s width. Accordingly, each CLNN layer in the figure has a 3-dimensional weight tensor composed of one central matrix

and two off-center weight matrices, and ( is the layer id). During the scanning of the signal across the temporal dimension, a frame in the window at index u is processed with its corresponding weight matrix of the same index. The size of each is equal to the feature vector length hidden layer width. The number of weight matrices is (the is for the central frame), which matches the number of frames in the window. The output of a single CLNN step over a window of frames is a single representative vector.

Several CLNN layers can be stacked on top of each other to form a deep architecture as shown in Fig. 2. The figure also depicts a number of extra frames remaining after the processing applied through the two CLNN layers. These extra frames allow incorporating an aggregation operation within the network by pooling the temporal dimension or they can be flattened to form a single vector before feeding them to a fully connected network. The CLNN is trained over segments following (1)


where is the segment size, is the order, is the number of layers and is for the extra frames. The input at each CLNN layer has fewer frames than the previous layer. For example, for , and , the input is of size 29 frames. The output of the first layer is frames. Similarly, the output of the second and third layers is and frames, respectively. The remaining frames of third layer are the extra frames to be pooled or flattened. The activation at a hidden node of a CLNN can be formulated as in (2)


where is the activation at node of a hidden layer for frame in a segment of size . This frame is also the window’s middle frame at . The output

is given by the value of the activation function

when applied on the summation of the bias of node and the multiplication of and . The input is the feature in a single feature vector of size at index within a window and is the weight between the input of a feature vector and the hidden node. The index (in and ) is for the temporal window of the interval of frames to be considered within . Reformulating (2) in a vector form is given in (3).


where is the activation vector observed at the hidden layer for the central frame conditioned on the input vectors in the interval is given by the activation function

applied on the summation of the bias vector

and the summation of the multiplication between the vector at index ( is for the window’s middle frame at and the index of the frame in the segment) and the corresponding weight matrix at the same index, where takes values in the range of the considered window from up to . The conditional distribution can be captured using a logistic function as in , where

is the hidden layer sigmoid function or the output layer softmax.

3 Masked Conditional Neural Networks

Figure 3: Masking patterns. a) and , b) the active links following the masking pattern in a. c) and

The Mel-Scaled analysis applied in MFCC and Mel-Scaled spectrograms, both used extensively as intermediate signal representation by sound recognition systems, exploit the use of a filterbank (a group of signal processing filters). Considering a sound signal represented in a spectrogram, the energy of a certain frequency bin may smear across nearby frequency bins. Aggregating the energy across neighbouring frequency bins is a possible representation to overcome the frequency shifts, which is tackled by filterbanks. More general mixtures across the bins could be hand-crafted to select the most prominent features for the signal under consideration.

The Masked ConditionaL Neural Network (MCLNN), we introduce in this work embeds a filterbank-like behaviour and allows the exploration of a range of feature combinations concurrently instead of manually hand-crafting the optimum mixture of features. Fig. 3 depicts the implementation of the filterbank-like behaviour through the binary mask enforced over the network’s links that activate different regions of a feature vector while deactivating others following a band-like pattern. The mask is designed based on two tunable hyper-parameters: the bandwidth and the overlap. Fig. 3.a. shows a binary mask having a bandwidth of 5 (the five consecutive ones in a column) and an overlap of 3 (the overlapping ones between two successive columns). A hidden node will act as an expert in a localized region of the feature vector without considering the rest of it. This is depicted in Fig. 3.b. The figure shows the active connections for each hidden node over a local region of the input feature vector matching the mask pattern in Fig. 3.a. The overlap can be assigned negative values as shown in Fig. 3.c. The figure shows a mask with a bandwidth of 3 and overlap of , depicted by the non-overlapping distance between the 1’s of two successive columns. Additionally, the figure shows an additional role introduced by the mask through the presence of shifted versions of the binary pattern across the first set of three columns compared to the second and third sets. The role involves the automatic exploration of a range of feature combinations concurrently. The columns in the figure map to hidden nodes. Therefore, for a single feature vector, the input at the node (corresponding to the column) will consider the first 3 features in the feature vector, the node will consider a different combination involving the first 2 features and the node will consider even a different combination using the first feature only. This behaviour embeds the mix-and-match operation within the network, allowing the hidden nodes to learn different properties through the different combinations of feature vectors meanwhile preserving the spatial locality. The position of the binary values is specified through a linear index following (4)


where is given by bandwidth , the overlap and the feature vector length . The term takes the values in and is in the interval . The binary masking is enforced through an element-wise multiplication following (5).


where is the original weight matrix at a certain index and is the masking pattern applied. is the new masked weight matrix to replace the weight matrix in (3).

4 Experiments

We performed the MCLNN evaluation using the GTZAN [14] and the ISMIR2004 datasets widely used in the literature for benchmarking several MIR tasks including genre classification. The GTZAN consists of 1000 music files categorized across 10 music genres (blues, classical, country, disco, hip-hop, jazz, metal, pop, reggae and rock). The ISMIR2004 dataset comprise training and testing splits of 729 files each. The splits have 6 unbalanced categories of music genres (classical, electronic, jazz-blues, metal-punk, rock-pop and world) of full length recordings. All files were resampled at 22050 Hz and chunks of 30 seconds were extracted. Logarithmic Mel-Scaled 256 frequency bins spectrogram transformation was applied using an FFT window of 2048 (

100 msec) and an overlap of 50%. The feature-wise z-score parameters of the training set was applied to both the validation and test sets. Segments of frames following (

1) were extracted.

Classifier and Features

CS + Multiple feat. sets[15] 92.7
SRC + LPNTF + Cortical features[16] 92.4
RBF-SVM + Scattering Trans.[17] 91.4
MCLNN + MelSpec.(this work) 85.1
RBF-SVM + Spec.DBN[4] 84.3
MCLNN + MelSpec.(this work) 84.1
Linear SVM + PSD on Octaves[18] 83.4
Random Forest + Spec.DBN[19] 83.0
AdaBoost + Several features[13] 83.0
RBF SVM + Spectral Covar.[20] 81.0
Linear SVM + PSD on frames[18] 79.4
SVM + DWCH[21] 78.5
Table 2: Accuracies on the ISMIR2004

Classifier and Features

SRC + NTF + Cortical features[16] 94.4
KNN + Rhythm&timbre[22] 90.0
SVM + Block-Level features [23] 88.3
MCLNN + MelSpec.(this work) 86.0
MCLNN + MelSpec.(this work) 84.8
MCLNN + MelSpec.(this work) 84.8
GMM + NMF[24] 83.5
MCLNN + MelSpec.(this work) 83.1
SVM + Symbolic features [25]] 81.4
NN + Spectral Similarity FP [26] 81.0
SVM + High-Order SVD [27] 81.0
SVM + Rhythm and SSD [28] 79.7
5-fold cross-validation 50% training, 20% validation and 30% testing leave-one-out cross-validation
10-fold cross-validation 450% training, 25% validation and 25% testing Not referenced
10 10-fold cross-validation 10(Train 729 file , test 729 file) Train 729 files,test 729 files
Table 1: Accuracies on the GTZAN










1 220 4 40 -10
2 200 4 10 3
Table 4: GTZAN random and filtered


Random Acc. %

Filtered Acc. %

MCLNN(this work) 84.4 65.8
DNN [25] 81.2 42.0
Table 3: MCLNN parameters

The network was trained to minimize the categorical cross entropy between the segment’s predicted label and the target one. The final decision of the clip’s genre is decided based on a majority voting across the frames of the clip. The experiments for both datasets were carried out using a 10-fold cross-validation that is repeated for 10 times. An additional experiment was applied using the ISMIR2004 dataset original split (729 training, 729 testing) that was also repeated for 10 times. We adapted a two-layered MCLNN, as listed in Table 4, followed by a single dimensional global mean pooling [12] layer to pool across

extra frames and finally a 50 node fully connected layer before the softmax output layer. Parametric Rectified Linear Units (PReLU)

[29] were used for all the model’s neurons. We applied the same model to both datasets to gauge the generalization of the MCLNN to datasets of different distributions. Tables 2 and Table 2 list the accuracy achieved by the MCLNN along with other methods widely cited in the literature for the genre classification task on the GTZAN and the ISMIR2004 datasets. MCLNN surpasses several state-of-the-art methods that are dependent on hand-crafted features or neural networks, achieving an accuracy of 85.1% and 86% over a 10-fold cross-validation for the GTZAN and ISMIR2004, respectively. We repeated the 10-fold cross-validation 10 times to validate the accuracy stability of the MCLNN, where the MCLNN achieved 84.1% and 84.83% over the 100 training runs for each of the GTZAN and the ISMIR2004, respectively.

To further evaluate the MCLNN performance, we adapted the publicly available splits released by Kereliuk et al.[30]. In their work, they released two versions of splits for the GTZAN files: a randomly stratified split (50% training, 25% validation and 25% testing) and a fault filtered version, where they cleared out all the mistakes in the GTZAN as reported by Sturm [31], e.g. repetitions, distortion, etc. As listed in Table 4, MCLNN achieved 84.4% and 65.8% compared to Kereliuk’s attempt that achieved 81.2% and 42% for the random and fault-filtered, respectively, in their attempt to reproduce the work by Hamel et al. [4]. The experiments show that MCLNN performs better than several neural networks based architectures and comparable to some other works dependent on hand-crafted features. MCLNN achieved these accuracies irrespective of the rhythmic and perceptual properties [32] that were used by methods that reported higher accuracies than the MCLNN. Finally, we wanted to tackle the problem of the data size used in training, referring to the works in [4, 13, 18, 20, 30], an FFT window of 50 msec was used. On the other hand, the MCLNN achieved the mentioned accuracies using a 100 msec window, which decreases the number of feature vectors to be used in classification by 50% and consequently the training complexity, which allows the MCLNN to scale for larger datasets.

5 Conclusions and Future work

We have introduced the ConditionaL Neural Network (CLNN) and its extension the Masked ConditionaL Neural Network (MCLNN). The CLNN is designed to exploit the properties of the multi-dimensional temporal signals by considering the sequential relationship across the temporal frames. The mask in the MCLNN enforces a systematic sparseness that follows a frequency band-like pattern. Additionally, it plays the role of automating the the exploration of a range of feature combinations concurrently analogous to the exhaustive manual search for the hand-crafted feature combinations. We have applied the MCLNN to the problem of genre classification. Through an extensive set of experiments without any especial rhythmic or timbral analysis, the MCLNN have sustained accuracies that surpass neural based and several hand-crafted feature extraction methods referenced previously on both the GTZAN and the ISMIR2004 datasets, achieving 85.1% and 86%, respectively. Meanwhile, the MCLNN still preserves the generalization that allows it to be adapted for any temporal signal. Future work will involve optimizing the mask patterns, considering different combinations of the order across the layers. We will also consider applying the MCLNN to other multi-channel temporal signals.


This work is funded by the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement no. 608014 (CAPACITIE).


  • [1]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    Neural Information Processing Systems, NIPS, 2012.
  • [2] O. Abdel-Hamid, A.-R. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, “Convolutional neural networks for speech recognition,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 22, no. 10, pp. 1533–1545, 2014.
  • [3] J. Schlüter,

    Unsupervised Audio Feature Extraction for Music Similarity Estimation

    Thesis, 2011.
  • [4]

    P. Hamel and D. Eck, “Learning features from music audio with deep belief networks,” in

    International Society for Music Information Retrieval Conference, ISMIR, 2010.
  • [5] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [6] G. W. Taylor, G. E. Hinton, and S. Roweis, “Modeling human motion using binary latent variables,” in Advances in Neural Information Processing Systems, NIPS, pp. 1345–1352, 2006.
  • [7] P. Smolensky, Information Processing in Dynamical Systems: Foundations of Harmony Theory, pp. 194–281. 1986.
  • [8] A.-R. Mohamed and G. Hinton, “Phone recognition using restricted boltzmann machines,” in IEEE International Conference on Acoustics Speech and Signal Processing, ICASSP, 2010.
  • [9] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput, vol. 9, no. 8, pp. 1735–80, 1997.
  • [10] J. Pons, T. Lidy, and X. Serra, “Experimenting with musically motivated convolutional neural networks,” in International Workshop on Content-based Multimedia Indexing, CBMI, 2016.
  • [11] K. J. Piczak, “Environmental sound classification with convolutional neural networks,” in

    IEEE international workshop on Machine Learning for Signal Processing (MLSP)

    , 2015.
  • [12] M. Lin, Q. Chen, and S. Yan, “Network in network,” in International Conference on Learning Representations, ICLR, 2014.
  • [13] J. Bergstra, N. Casagrande, D. Erhan, D. Eck, and B. Kégl, “Aggregate features and adaboost for music classification,” Machine Learning, vol. 65, no. 2-3, pp. 473–484, 2006.
  • [14] G. Tzanetakis and P. Cook, “Musical genre classification of audio signals,” IEEE Transactions On Speech And Audio Processing, vol. 10, no. 5, 2002.
  • [15] K. K. Chang, J.-S. R. Jang, and C. S. Iliopoulos, “Music genre classification via compressive sampling,” in International Society for Music Information Retrieval, ISMIR, 2010.
  • [16] Y. Panagakis, C. Kotropoulos, and G. R. Arce, “Music genre classification using locality preserving non-negative tensor factorization and sparse representations,” in International Society for Music Information Retrieval Conference, ISMIR, 2009.
  • [17] J. Anden and S. Mallat, “Deep scattering spectrum,” IEEE Transactions on Signal Processing, vol. 62, no. 16, pp. 4114–4128, 2014.
  • [18]

    M. Henaff, K. Jarrett, K. Kavukcuoglu, and Y. LeCun, “Unsupervised learning of sparse features for scalable audio classification,” in

    International Society for Music Information Retrieval, ISMIR, 2011.
  • [19] S. Sigtia and S. Dixon, “Improved music feature learning with deep neural networks,” in International Conference on Acoustics, Speech, and Signal Processing, ICASSP, 2014.
  • [20] J. Bergstra, M. Mandel, and D. Eck, “Scalable genre and tag prediction with spectral covariance,” in International Society for Music Information Retrieval, ISMIR, 2010.
  • [21] T. Li, M. Ogihara, and Q. Li, “A comparative study on content-based music genre classification,” in ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR, 2003.
  • [22] T. Pohle, D. Schnitzer, M. Schedl, P. Knees, and G. Widmer, “On rhythm and general music similarity,” in International Society for Music Information Retrieval, ISMIR, 2009.
  • [23] K. Seyerlehner, M. Schedl, T. Pohle, and P. Knees, “Using block-level features for genre classification, tag classification and music similarity estimation,” in Music Information Retrieval eXchange, MIREX, 2010.
  • [24] A. Holzapfel and Y. Stylianou, “Musical genre classification using nonnegative matrix factorization-based features,” IEEE Transactions on Audio Speech and Language Processing, vol. 16, no. 2, pp. 424–434, 2008.
  • [25] T. Lidy, A. Rauber, A. Pertusa, and J. M. Inesta, “Improving genre classification by combination of audio and symbolic descriptors using a transcription system,” in International Conference on Music Information Retrieval, 2007.
  • [26] E. Pampalk, A. Flexer, and G. Widmer, “Improvements of audio-based music similarity and genre classification,” in International Conference on Music Information Retrieval, ISMIR, 2005.
  • [27] I. Panagakis, E. Benetos, and C. Kotropoulos, “Music genre classification: A multilinear approach,” in International Society for Music Information Retrieval, ISMIR, 2008.
  • [28] T. Lidy and A. Rauber, “Evaluation of feature extractors and psycho-acoustic transformations for music genre classification,” in International Conference on Music Information Retrieval, ISMIR, 2005.
  • [29] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in

    IEEE International Conference on Computer Vision, ICCV

    , 2015.
  • [30]

    C. Kereliuk, B. L. Sturm, and J. Larsen, “Deep learning and music adversaries,”

    IEEE Transactions on Multimedia, vol. 17, no. 11, pp. 2059–2071, 2015.
  • [31] B. L. Sturm, “The state of the art ten years after a state of the art: Future research in music information retrieval,” Journal of New Music Research, vol. 43, no. 2, pp. 147–172, 2014.
  • [32] J. P. Bello, Machine Listening of Music, pp. 159–184. New York, NY: Springer New York, 2014.