Predominant Musical Instrument Classification based on Spectral Features

by   Ankit Khairkar, et al.

This work aims to examine one of the cornerstone problems of Musical Instrument Recognition, in particular instrument classification. IRMAS (Instrument recognition in Musical Audio Signals) data set is chosen. The data includes music obtained from various decades in the last century, thus having a wide variety in audio quality. We have presented a very concise summary of past work in this domain. Having implemented various supervised learning algorithms for this classification task, SVM classifier has outperformed the other state-of-the-art models with an accuracy of 79 challenge distinguishing between flute and organ. We also implemented Unsupervised techniques out of which Hierarchical Clustering has performed well. We have included most of the code (jupyter notebook) for easy reproducibility.



page 3

page 6


Musical Instrument Recognition by XGBoost Combining Feature Fusion

Musical instrument classification is one of the focuses of Music Informa...

Use of speaker recognition approaches for learning timbre representations of musical instrument sounds from raw waveforms

Timbre representations of musical instruments, essential for diverse app...

Leveraging Hierarchical Structures for Few-Shot Musical Instrument Recognition

Deep learning work on musical instrument recognition has generally focus...

Supporting Musical Practice Sessions Through HMD-Based Augmented Reality

Learning a musical instrument requires a lot of practice, which ideally,...

Deep Single Shot Musical Instrument Identification using Scalograms

Musical Instrument Identification has for long had a reputation of being...

Musical Instrument Recognition Using Their Distinctive Characteristics in Artificial Neural Networks

In this study an Artificial Neural Network was trained to classify music...

Musical Instrument Playing Technique Detection Based on FCN: Using Chinese Bowed-Stringed Instrument as an Example

Unlike melody extraction and other aspects of music transcription, resea...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Music is to the soul what words are to the mind. With the advent of massive online streaming content, there is a need for on-demand music searches that could be manageable and stored easily. Also, audio tagging has become a challenge to explore. Music Information Retrieval(MIR) is the interdisciplinary research focused on retrieving information from music. Past the commercial implications, the development of robust MIR systems will contribute to a myriad of applications that include Recommender systems, Genre Identification and Catalogue Creation thus making the entire catalogue manageable and accessible with ease.

1.1 Related Works

Essid et al. [3]

studied the classification of five different woodwind instruments. Mel frequency cepstral coefficient(MFCC) features were extracted from the training tracks as they were found helpful for classification based on tremolo, vibrato and sound attack. PCA was performed on the MFCC features for dimension reduction before feeding the transformed features to Gaussian Mixture model (GMM) and support vector machine (SVM) classification. GMM with 16, 32 Gaussian components were used, which resulted in better classification accuracy for the later. SVM was also performed with linear and polynomial kernels where the former was found to be efficient.

Heittola et al. [5]

proposed a unique way to identify predominant musical instrument from a polyphonic audio file. Training data consists of 19 different musical instruments. In the pre-processing of the audio file, multiple decomposition techniques are discussed such as Independent component analysis (ICA) and non-negative matrix factorization(NFM). The later provided a better separation of signals from the mixture of different sources of sound in the audio sample. Then Mel-frequency cepstral coefficient features are extracted from the reconstructed signals and fed to Gaussian Mixture model for classification.

Han et al. [4]

’s approach to identify instrument revolved around extracting features from mel-spectrogram using convolutional layers of CNN. Mel spectrogram is the image containing information about playing style, frequency of sound excerpt and various spectral characteristics in music. The input given to the CNN is the magnitude of mel-frequency spectrogram which is compressed with natural logarithm. Various sampling techniques and transformations are performed to extract most of the information from sound excerpt which can be referred to in this paper. CNN architecture is proposed to identify instrument which comprised of convolutional layer which extracts feature from spectrogram automatically and max pooling is used for dimensionality reduction and classification. After experimenting with various activation functions, ‘ReLU’ (alpha = 0.33) gave the best classification result with the overall F score of 0.602 on IRMAS training data.

Hershey et al. [6]

researched on Musical Instrument recognition in videos’. Their work primarily revolved around comparison of different neural network architectures based on accuracy. It has been observed that ResNet-50 yields the better result amongst Fully Connected, VGG, AlexNet architectures. The data set Youtube-100M was created for this study.

Toghiani-Rizi and Windmark [11]

collected different music samples, transformed them into frequency domain and trained using the ANN model. Researches were carried out considering different circumstances – Complete music sample, using only the Attack, every characteristic of music sample except Attack, the primary 100Hz frequency spectrum and the subsequent 900Hz of the same spectrum. The advantage of choosing frequency domain over time is to discretize the music sample directly, which is otherwise continuous. This would ease out the pre-processing effort.

Murthy and Koolagudi [9] ascertained and critically reviewed the methods of extracting music related information given an audio sample. Emphasis was given on real data sets that are publicly available and gained popularity in the field of Music Information Retrieval. Areas covered under the study involve Music Similarity and Indexing, Genre, artist and raga identification along with music emotion classification. The research area finds its applications particular to personalized music cataloging and recommendations.

1.2 Our Contribution

To solve any classification problem of audio and video file, the most important thing is to choose how to extract features from a given audio/video file. While dealing with our audio data set, we found that despite having the same notes of sound, the Spectrogram differs based on the instrument from which the note originates.As an illustration, we recorded the same note with four different instruments and generated the corresponding spectrograms, shown in the figure 1. It is evident that we can use this property of the spectrogram to predict the instrument used while playing a particular sound excerpt.

According to Eronen and Klapuri [2]

, Timbre, perceptually, is the ‘colour’ of a sound. Experiments have sought to construct a low-dimensional space to accommodate similarity ratings. Efforts are then made to interpret these ratings acoustically or perceptually. The two principal dimensions here are spectral centroid and rise time. Spectral centroid corresponds to the perceived brightness of sound. Rise time measures the time difference between the start and the moment of highest amplitude.

Deng et al. [1] have shown instruments usually have some unique properties that can be described by their harmonic spectra and their temporal and spectral envelopes. They have shown only the first few coefficients are enough for accurate classification.

For Spectral Analysis, MFCC is the best choice. According to Wikipedia[12] “the mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel-scale of frequency". Mel is a number that links to a pitch, which is analogous to how a frequency is described by a pitch. The basic flow of calculating the MFC Coefficients is outlined below:

Speech Signal

Fast FourierTransform

Mel ScaleFiltering


Discrete CosineTransform


The mathematical formula for frequency-to-mel transform is given as:

1.3 Motivation of our work

Solving any classification problem involving musical instrument requires us to extract features from a given audio. For our data set, we found that a note originating from two different instruments has different corresponding spectrograms. As an illustration, we recorded the same note with 4 different instruments and generated the spectrograms as attached below. From the figures, it is evident that we can use this property of the audio to recognize the instrument from which the audio sample originates.

(a) EDM
(b) Guitar (Accoustic)
(c) Key Board
(d) Organ
Figure 1: Same note (audio) played by various instruments and their Spectograms

MFCCs are obtained by transforming frequency (hertz) scale to mel scale. Typically, MFCC coefficients are numbered from the

th to 20th order and the first 13 coefficients are sufficient for our classification task. The lower order cepstral coefficients are primary representatives of the instrument. Though coefficients of the higher order give more fine tuned spectral details, choosing greater number of cepstral coefficients lands us in models of increased complexity. This would in turn require more training data for the estimation of model parameters.

1.4 Data set

IRMAS (Instrument recognition in Musical Audio Signals) data has been used. This data is polyphonic, so that a robust classifier could be built. The data consists of .wav files of 3 seconds duration of many instruments, about eleven. We have chosen six of these instruments for recognition. Our data has 3846 samples of music running into about three hours, giving sufficient data for training and testing purposes as well. In addition, the data consists of multiple genres including country folk, classical, pop-rock and latin soul. Inclusion of these multiple genres could lead to better training. The data has been downloaded from Number of audio samples per instrument class is reproduced in table 1.

Instrument Number of Samples Clip Length (in sec)
Flute 451 1,353
Piano 721 2,163
Trumpet 577 1,731
Guitar 637 1,911
Voice 778 2,334
Organ 682 2,046
Total 3,846 11,538 (3 hr 12 min)
Table 1: Instrument Samples and Clip Length

2 Methodology & Our Approach

Code associated with our study is available for public use at

2.1 Feature Extraction

Deng et al. [1]

have shown that for achieving more accurate classification of musical instruments, it is essential to extract more complicated features apart from MFCC. Hence, we considered other features like Zero-crossing rate, Spectral centroid, Spectral bandwidth and Spectral roll-off during our feature extraction via Librosa. Zero Crossing rate indicates the rate at which the signal crosses zero. Spectral Centroid is a measure to indicate the center of mass of the spectrum being located, featuring the impression of brightness characteristic of a sample. Spectral bandwidth gives the weighted average of the frequency signal by its spectrum. Spectral roll-off is the frequency under which a certain proportion of the overall spectral energy belongs to.

To extract features from the audio files, we researched the standard libraries. We had two options – Librosa[7] and Essentia[8]. We implemented both. We used Python-based Intel’s Jupyter platform and scikit-learn [10]

. Scikit-learn is an easy-to-use and open-source Machine Learning Library that supports most of the Supervised Classification techniques.

Librosa is the Python package used for music and audio data analysis. Some important functions of librosa include Load, Display and Features. ‘Load’ loads an audio file as floating point time series. ‘Display’ provides visualizations such as waveform, spectrogram using matplotlib. ‘Features’ is used for extraction and manipulation of MFCC and other spectral features. MFCCs are obtained by transforming from frequency (hertz) scale to mel-scale.

On the other hand, ‘Essentia’ is an open-source C++ based distribution package available under Python environment for audio-based musical information retrieval. This library computes spectral energy associated with mel bands and their MFCCs of an audio sample. Windowing procedure is also implemented in Essentia. It analyzes the frequency content of an audio spectrum by creating a short sound segment of a few milli-seconds for a relatively longer signal. By default, we used Hann window[8]. It is a smoothing window typically characterized by good frequency resolution and reduced spectral leakage. The audio spectrum is analyzed by extracting MFCCs based on the default inputs of hopSize (hop length between frames) and frame size. The default parameters for sampling rate is 44.1 kHz, hopSize of 512 and frame size of 1024 in Essentia. The features thus extracted from manifold segments of a sample signal are aggregated with their mean. They are then used as the features for each sample labeled with their instrument class.

Librosa is one of the first Python libraries introduced to extract features from audio data. Librosa is also widely used and has a more established community than Essentia. In our work, Librosa has provided better accuracy in out-of-sample validation. Hence we preferred Librosa as it led to greater accuracy.

2.2 Classifier Training

We extracted the first 13 MFCC features using Librosa/Essentia. For each audio clip, we obtained 259 13 matrix features. We took the mean of all the columns to get the condensed feature providing us with 1 13 feature vector, along with five other features as mentioned above. We labeled each vector with the instrument class using scikit-learn’s ‘labelencoder’.

Supervised Classification Techniques

We implemented 80-20 shuffled split for training and testing sets along with cross validation to avoid over-fitting. Then we used different supervised classification techniques to identify the predominant musical instrument from the audio file. Initially we started with logistic regression and decision tree classifier. Classification trees are usually prone to over-fitting, so it did not perform well on the test data. We also used bagging and boosting techniques to train the MFCC and spectral features. We tried Random Forest to control the over-fitting. With some parameter tuning, it provided us with the better classification. We also tried XGBoost on the same set of features and after fine-tuning the parameters, Gradient Boosting classified the instruments with an accuracy of 0.7.

We also used Support vector Machine(SVM) Classifier to fit the extracted features. It outperformed the traditional classification techniques. We used ‘radial basis function’ kernel for this non-linear classification. We also fine-tuned penalty parameter ’C’ and kernel coefficient ‘gamma’. This improved the overall accuracy on test data. We cross validated the accuracy score of 79.41% with 10 folds. We also designed a simple neural network to perform this classification. We used three layers with 30, 15 and 6 neurons on these layers respectively. We applied ‘relu’ activation function on the first two layers and ‘sigmoid’ on the last layer of this network. This Neural network also performed better than most of the traditional techniques. In terms of accuracy, Bagging and Boosting models such as random forest and XGboost performed better than traditional models such as classification trees and logistic regression. Finally, SVM turned out to be more accurate than other classifiers.

2.3 Evaluation Criteria

The following evaluation metrics were used to judge the performance of the model

  • Precision is the ratio , where is the number of true positives and the number of false positives. Precision intuitively describes the ability of the classifier not to label a false positive as positive. Precision for various models implemented is shown in box-plot see Fig: (a)a.

  • Recall is the ratio where is the number of true positives and the number of false negatives. Recall is intuitively the ability of the classifier to identify all the positive samples. Fig: (b)b shows illustrative visualization of Recall for various supervised classifiers implemented.

  • F1 score can be interpreted as the harmonic mean of Precision and Recall.

  • Confusion Matrix evaluates the performance of a supervised classifier using a cross-tabulation of actual and predicted classes. The comparison for various models is shown in Fig: 2

    Logistic Regession Decision Tree LGBM
    Instrument P R F1 P R F1 P R F1
    Flute 0.58 0.39 0.47 0.43 0.44 0.43 0.66 0.59 0.62
    Piano 0.55 0.59 0.57 0.53 0.54 0.53 0.69 0.73 0.71
    Trumpet 0.44 0.53 0.48 0.50 0.46 0.48 0.59 0.67 0.63
    Guitar 0.63 0.57 0.60 0.60 0.57 0.58 0.73 0.68 0.71
    Voice 0.58 0.48 0.52 0.52 0.50 0.51 0.72 0.54 0.62
    Organ 0.51 0.61 0.56 0.50 0.55 0.52 0.63 0.74 0.68

    XG Boost RF SVM
    Instrument P R F1 P R F1 P R F1
    Flute 0.66 0.59 0.62 0.72 0.48 0.58 0.63 0.63 0.63
    Piano 0.72 0.71 0.71 0.72 0.75 0.74 0.79 0.84 0.81
    Trumpet 0.58 0.69 0.63 0.61 0.72 0.66 0.78 0.77 0.78
    Guitar 0.71 0.72 0.71 0.73 0.72 0.72 0.77 0.76 0.77
    Voice 0.75 0.53 0.62 0.74 0.54 0.62 0.78 0.67 0.72
    Organ 0.65 0.74 0.69 0.63 0.80 0.70 0.79 0.85 0.82
    Table 2: Precision, Recall & F1 Score for various Supervised Models
(a) Logistic Regression
(b) Decision Tree
(c) Light GBM
(d) XG Boost
(e) Random Forest
(f) SVM
Figure 2: Confusion Matrix for various supervised Algorithms
(a) Precision
(b) Recall
(c) F1 Score
(d) Accuracy
Figure 3: Evaluation Metric for Various Supervised Algorithms
Figure 4: Instrument wise classification

3 Results

From the table 2

, it is clear that SVM gives the highest accuracy. However, it has been observed that the classifier is unable to distinguish between flute and organ. We also tried to classify instruments using 3-layered neural network and obtained the accuracy of 64%, lesser than SVM. We also tried unsupervised algorithms like ‘K-means’ and ‘Hierarchical Clustering’. Clusters were not formed as per instruments when we used K-means algorithm. Hierarchical clustering produced significant results when we cut the dendogram at the 30th level as shown in Fig: 5.

Figure 5: Hierarchical Cluster

4 Future Scope

There is scope to use the same approach on a different data set. One can explore the idea of classifying Indian instruments. More libraries for extraction of MFCC features can be explored, as we implemented only two libraries viz. Librosa and Essentia. One can look at designing neural networks of greater complexity to identify a given musical instrument. More features in addition to the MFCCs can be studied and extracted using signal processing techniques to improve the accuracy of instrument classification.


We would like to thank Prof. Debapriyo Majumdar for allowing us to pursue such an interesting topic for our course project. We would also like to thank Intel Devcloud for providing us a virtual machine which we used in our experiments. .


  • Deng et al. [2008] Jeremiah D Deng, Christian Simmermacher, and Stephen Cranefield. A study on feature analysis for musical instrument classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 38(2):429–438, 2008.
  • Eronen and Klapuri [2000] A. Eronen and A. Klapuri. Musical instrument recognition using cepstral coefficients and temporal features. In Proceedings of the Acoustics, Speech, and Signal Processing, 2000. On IEEE International Conference - Volume 02, ICASSP ’00, pages II753–II756, Washington, DC, USA, 2000. IEEE Computer Society. ISBN 0-7803-6293-4. doi: 10.1109/ICASSP.2000.859069. URL
  • Essid et al. [2004] Slim Essid, Gaël Richard, and Bertrand David. Musical instrument recognition on solo performances. In 2004 12th European signal processing conference, pages 1289–1292. IEEE, 2004.
  • Han et al. [2017] Yoonchang Han, Jaehun Kim, Kyogu Lee, Yoonchang Han, Jaehun Kim, and Kyogu Lee.

    Deep convolutional neural networks for predominant instrument recognition in polyphonic music.

    IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 25(1):208–221, 2017.
  • Heittola et al. [2009] Toni Heittola, Anssi Klapuri, and Tuomas Virtanen. Musical instrument recognition in polyphonic audio using source-filter model for sound separation. In ISMIR, pages 327–332, 2009.
  • Hershey et al. [2017] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. Cnn architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp), pages 131–135. IEEE, 2017.
  • McFee et al. [2015] Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference, volume 8, 2015.
  • MTG upf [2019] MTG upf. Essentia open-source library and tools for audio and music analysis, description and synthesis, 2019. URL [Online; accessed 23-November-2019].
  • Murthy and Koolagudi [2018] YV Murthy and Shashidhar G Koolagudi. Content-based music information retrieval (cb-mir) and its applications toward the music industry: A review. ACM Computing Surveys (CSUR), 51(3):45, 2018.
  • Pedregosa et al. [2011] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • Toghiani-Rizi and Windmark [2017] Babak Toghiani-Rizi and Marcus Windmark. Musical instrument recognition using their distinctive characteristics in artificial neural networks. arXiv preprint arXiv:1705.04971, 2017.
  • Wikipedia contributors [2019] Wikipedia contributors. Mel-frequency cepstrum — Wikipedia, the free encyclopedia, 2019. URL [Online; accessed 23-November-2019].