1 Introduction
Music is to the soul what words are to the mind. With the advent of massive online streaming content, there is a need for on-demand music searches that could be manageable and stored easily. Also, audio tagging has become a challenge to explore. Music Information Retrieval(MIR) is the interdisciplinary research focused on retrieving information from music. Past the commercial implications, the development of robust MIR systems will contribute to a myriad of applications that include Recommender systems, Genre Identification and Catalogue Creation thus making the entire catalogue manageable and accessible with ease.
1.1 Related Works
Essid et al. [3]
studied the classification of five different woodwind instruments. Mel frequency cepstral coefficient(MFCC) features were extracted from the training tracks as they were found helpful for classification based on tremolo, vibrato and sound attack. PCA was performed on the MFCC features for dimension reduction before feeding the transformed features to Gaussian Mixture model (GMM) and support vector machine (SVM) classification. GMM with 16, 32 Gaussian components were used, which resulted in better classification accuracy for the later. SVM was also performed with linear and polynomial kernels where the former was found to be efficient.
Heittola et al. [5]
proposed a unique way to identify predominant musical instrument from a polyphonic audio file. Training data consists of 19 different musical instruments. In the pre-processing of the audio file, multiple decomposition techniques are discussed such as Independent component analysis (ICA) and non-negative matrix factorization(NFM). The later provided a better separation of signals from the mixture of different sources of sound in the audio sample. Then Mel-frequency cepstral coefficient features are extracted from the reconstructed signals and fed to Gaussian Mixture model for classification.
Han et al. [4]
’s approach to identify instrument revolved around extracting features from mel-spectrogram using convolutional layers of CNN. Mel spectrogram is the image containing information about playing style, frequency of sound excerpt and various spectral characteristics in music. The input given to the CNN is the magnitude of mel-frequency spectrogram which is compressed with natural logarithm. Various sampling techniques and transformations are performed to extract most of the information from sound excerpt which can be referred to in this paper. CNN architecture is proposed to identify instrument which comprised of convolutional layer which extracts feature from spectrogram automatically and max pooling is used for dimensionality reduction and classification. After experimenting with various activation functions, ‘ReLU’ (alpha = 0.33) gave the best classification result with the overall F score of 0.602 on IRMAS training data.
Hershey et al. [6]
researched on Musical Instrument recognition in videos’. Their work primarily revolved around comparison of different neural network architectures based on accuracy. It has been observed that ResNet-50 yields the better result amongst Fully Connected, VGG, AlexNet architectures. The data set Youtube-100M was created for this study.
Toghiani-Rizi and Windmark [11]
collected different music samples, transformed them into frequency domain and trained using the ANN model. Researches were carried out considering different circumstances – Complete music sample, using only the Attack, every characteristic of music sample except Attack, the primary 100Hz frequency spectrum and the subsequent 900Hz of the same spectrum. The advantage of choosing frequency domain over time is to discretize the music sample directly, which is otherwise continuous. This would ease out the pre-processing effort.
Murthy and Koolagudi [9] ascertained and critically reviewed the methods of extracting music related information given an audio sample. Emphasis was given on real data sets that are publicly available and gained popularity in the field of Music Information Retrieval. Areas covered under the study involve Music Similarity and Indexing, Genre, artist and raga identification along with music emotion classification. The research area finds its applications particular to personalized music cataloging and recommendations.
1.2 Our Contribution
To solve any classification problem of audio and video file, the most important thing is to choose how to extract features from a given audio/video file. While dealing with our audio data set, we found that despite having the same notes of sound, the Spectrogram differs based on the instrument from which the note originates.As an illustration, we recorded the same note with four different instruments and generated the corresponding spectrograms, shown in the figure 1. It is evident that we can use this property of the spectrogram to predict the instrument used while playing a particular sound excerpt.
According to Eronen and Klapuri [2]
, Timbre, perceptually, is the ‘colour’ of a sound. Experiments have sought to construct a low-dimensional space to accommodate similarity ratings. Efforts are then made to interpret these ratings acoustically or perceptually. The two principal dimensions here are spectral centroid and rise time. Spectral centroid corresponds to the perceived brightness of sound. Rise time measures the time difference between the start and the moment of highest amplitude.
Deng et al. [1] have shown instruments usually have some unique properties that can be described by their harmonic spectra and their temporal and spectral envelopes. They have shown only the first few coefficients are enough for accurate classification.
For Spectral Analysis, MFCC is the best choice. According to Wikipedia[12] “the mel-frequency cepstrum (MFC) is a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel-scale of frequency". Mel is a number that links to a pitch, which is analogous to how a frequency is described by a pitch. The basic flow of calculating the MFC Coefficients is outlined below:
The mathematical formula for frequency-to-mel transform is given as:
1.3 Motivation of our work
Solving any classification problem involving musical instrument requires us to extract features from a given audio. For our data set, we found that a note originating from two different instruments has different corresponding spectrograms. As an illustration, we recorded the same note with 4 different instruments and generated the spectrograms as attached below. From the figures, it is evident that we can use this property of the audio to recognize the instrument from which the audio sample originates.
![]() |
![]() |
![]() |
![]() |
MFCCs are obtained by transforming frequency (hertz) scale to mel scale. Typically, MFCC coefficients are numbered from the
th to 20th order and the first 13 coefficients are sufficient for our classification task. The lower order cepstral coefficients are primary representatives of the instrument. Though coefficients of the higher order give more fine tuned spectral details, choosing greater number of cepstral coefficients lands us in models of increased complexity. This would in turn require more training data for the estimation of model parameters.
1.4 Data set
IRMAS (Instrument recognition in Musical Audio Signals) data has been used. This data is polyphonic, so that a robust classifier could be built. The data consists of .wav files of 3 seconds duration of many instruments, about eleven. We have chosen six of these instruments for recognition. Our data has 3846 samples of music running into about three hours, giving sufficient data for training and testing purposes as well. In addition, the data consists of multiple genres including country folk, classical, pop-rock and latin soul. Inclusion of these multiple genres could lead to better training. The data has been downloaded from https://www.upf.edu/web/mtg/irmas. Number of audio samples per instrument class is reproduced in table 1.
Instrument | Number of Samples | Clip Length (in sec) |
---|---|---|
Flute | 451 | 1,353 |
Piano | 721 | 2,163 |
Trumpet | 577 | 1,731 |
Guitar | 637 | 1,911 |
Voice | 778 | 2,334 |
Organ | 682 | 2,046 |
Total | 3,846 | 11,538 (3 hr 12 min) |
2 Methodology & Our Approach
Code associated with our study is available for public use at https://github.com/vntkumar8/musical-instrument-classification.
2.1 Feature Extraction
Deng et al. [1]
have shown that for achieving more accurate classification of musical instruments, it is essential to extract more complicated features apart from MFCC. Hence, we considered other features like Zero-crossing rate, Spectral centroid, Spectral bandwidth and Spectral roll-off during our feature extraction via Librosa. Zero Crossing rate indicates the rate at which the signal crosses zero. Spectral Centroid is a measure to indicate the center of mass of the spectrum being located, featuring the impression of brightness characteristic of a sample. Spectral bandwidth gives the weighted average of the frequency signal by its spectrum. Spectral roll-off is the frequency under which a certain proportion of the overall spectral energy belongs to.
To extract features from the audio files, we researched the standard libraries. We had two options – Librosa[7] and Essentia[8]. We implemented both. We used Python-based Intel’s Jupyter platform and scikit-learn [10]
. Scikit-learn is an easy-to-use and open-source Machine Learning Library that supports most of the Supervised Classification techniques.
Librosa is the Python package used for music and audio data analysis. Some important functions of librosa include Load, Display and Features. ‘Load’ loads an audio file as floating point time series. ‘Display’ provides visualizations such as waveform, spectrogram using matplotlib. ‘Features’ is used for extraction and manipulation of MFCC and other spectral features. MFCCs are obtained by transforming from frequency (hertz) scale to mel-scale.
On the other hand, ‘Essentia’ is an open-source C++ based distribution package available under Python environment for audio-based musical information retrieval. This library computes spectral energy associated with mel bands and their MFCCs of an audio sample. Windowing procedure is also implemented in Essentia. It analyzes the frequency content of an audio spectrum by creating a short sound segment of a few milli-seconds for a relatively longer signal. By default, we used Hann window[8]. It is a smoothing window typically characterized by good frequency resolution and reduced spectral leakage. The audio spectrum is analyzed by extracting MFCCs based on the default inputs of hopSize (hop length between frames) and frame size. The default parameters for sampling rate is 44.1 kHz, hopSize of 512 and frame size of 1024 in Essentia. The features thus extracted from manifold segments of a sample signal are aggregated with their mean. They are then used as the features for each sample labeled with their instrument class.
Librosa is one of the first Python libraries introduced to extract features from audio data. Librosa is also widely used and has a more established community than Essentia. In our work, Librosa has provided better accuracy in out-of-sample validation. Hence we preferred Librosa as it led to greater accuracy.
2.2 Classifier Training
We extracted the first 13 MFCC features using Librosa/Essentia. For each audio clip, we obtained 259 13 matrix features. We took the mean of all the columns to get the condensed feature providing us with 1 13 feature vector, along with five other features as mentioned above. We labeled each vector with the instrument class using scikit-learn’s ‘labelencoder’.
Supervised Classification Techniques
We implemented 80-20 shuffled split for training and testing sets along with cross validation to avoid over-fitting. Then we used different supervised classification techniques to identify the predominant musical instrument from the audio file. Initially we started with logistic regression and decision tree classifier. Classification trees are usually prone to over-fitting, so it did not perform well on the test data. We also used bagging and boosting techniques to train the MFCC and spectral features. We tried Random Forest to control the over-fitting. With some parameter tuning, it provided us with the better classification. We also tried XGBoost on the same set of features and after fine-tuning the parameters, Gradient Boosting classified the instruments with an accuracy of 0.7.
We also used Support vector Machine(SVM) Classifier to fit the extracted features. It outperformed the traditional classification techniques. We used ‘radial basis function’ kernel for this non-linear classification. We also fine-tuned penalty parameter ’C’ and kernel coefficient ‘gamma’. This improved the overall accuracy on test data. We cross validated the accuracy score of 79.41% with 10 folds. We also designed a simple neural network to perform this classification. We used three layers with 30, 15 and 6 neurons on these layers respectively. We applied ‘relu’ activation function on the first two layers and ‘sigmoid’ on the last layer of this network. This Neural network also performed better than most of the traditional techniques. In terms of accuracy, Bagging and Boosting models such as random forest and XGboost performed better than traditional models such as classification trees and logistic regression. Finally, SVM turned out to be more accurate than other classifiers.
2.3 Evaluation Criteria
The following evaluation metrics were used to judge the performance of the model
-
Precision is the ratio , where is the number of true positives and the number of false positives. Precision intuitively describes the ability of the classifier not to label a false positive as positive. Precision for various models implemented is shown in box-plot see Fig: (a)a.
-
Recall is the ratio where is the number of true positives and the number of false negatives. Recall is intuitively the ability of the classifier to identify all the positive samples. Fig: (b)b shows illustrative visualization of Recall for various supervised classifiers implemented.
-
F1 score can be interpreted as the harmonic mean of Precision and Recall.
-
Confusion Matrix evaluates the performance of a supervised classifier using a cross-tabulation of actual and predicted classes. The comparison for various models is shown in Fig: 2
Logistic Regession Decision Tree LGBM Instrument P R F1 P R F1 P R F1 Flute 0.58 0.39 0.47 0.43 0.44 0.43 0.66 0.59 0.62 Piano 0.55 0.59 0.57 0.53 0.54 0.53 0.69 0.73 0.71 Trumpet 0.44 0.53 0.48 0.50 0.46 0.48 0.59 0.67 0.63 Guitar 0.63 0.57 0.60 0.60 0.57 0.58 0.73 0.68 0.71 Voice 0.58 0.48 0.52 0.52 0.50 0.51 0.72 0.54 0.62 Organ 0.51 0.61 0.56 0.50 0.55 0.52 0.63 0.74 0.68
XG Boost RF SVM Instrument P R F1 P R F1 P R F1 Flute 0.66 0.59 0.62 0.72 0.48 0.58 0.63 0.63 0.63 Piano 0.72 0.71 0.71 0.72 0.75 0.74 0.79 0.84 0.81 Trumpet 0.58 0.69 0.63 0.61 0.72 0.66 0.78 0.77 0.78 Guitar 0.71 0.72 0.71 0.73 0.72 0.72 0.77 0.76 0.77 Voice 0.75 0.53 0.62 0.74 0.54 0.62 0.78 0.67 0.72 Organ 0.65 0.74 0.69 0.63 0.80 0.70 0.79 0.85 0.82 Table 2: Precision, Recall & F1 Score for various Supervised Models
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |

3 Results
From the table 2
, it is clear that SVM gives the highest accuracy. However, it has been observed that the classifier is unable to distinguish between flute and organ. We also tried to classify instruments using 3-layered neural network and obtained the accuracy of 64%, lesser than SVM. We also tried unsupervised algorithms like ‘K-means’ and ‘Hierarchical Clustering’. Clusters were not formed as per instruments when we used K-means algorithm. Hierarchical clustering produced significant results when we cut the dendogram at the 30th level as shown in Fig: 5.

4 Future Scope
There is scope to use the same approach on a different data set. One can explore the idea of classifying Indian instruments. More libraries for extraction of MFCC features can be explored, as we implemented only two libraries viz. Librosa and Essentia. One can look at designing neural networks of greater complexity to identify a given musical instrument. More features in addition to the MFCCs can be studied and extracted using signal processing techniques to improve the accuracy of instrument classification.
Acknowledgments
We would like to thank Prof. Debapriyo Majumdar for allowing us to pursue such an interesting topic for our course project. We would also like to thank Intel Devcloud for providing us a virtual machine which we used in our experiments. .
References
- Deng et al. [2008] Jeremiah D Deng, Christian Simmermacher, and Stephen Cranefield. A study on feature analysis for musical instrument classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 38(2):429–438, 2008.
- Eronen and Klapuri [2000] A. Eronen and A. Klapuri. Musical instrument recognition using cepstral coefficients and temporal features. In Proceedings of the Acoustics, Speech, and Signal Processing, 2000. On IEEE International Conference - Volume 02, ICASSP ’00, pages II753–II756, Washington, DC, USA, 2000. IEEE Computer Society. ISBN 0-7803-6293-4. doi: 10.1109/ICASSP.2000.859069. URL http://dx.doi.org/10.1109/ICASSP.2000.859069.
- Essid et al. [2004] Slim Essid, Gaël Richard, and Bertrand David. Musical instrument recognition on solo performances. In 2004 12th European signal processing conference, pages 1289–1292. IEEE, 2004.
-
Han et al. [2017]
Yoonchang Han, Jaehun Kim, Kyogu Lee, Yoonchang Han, Jaehun Kim, and Kyogu Lee.
Deep convolutional neural networks for predominant instrument recognition in polyphonic music.
IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 25(1):208–221, 2017. - Heittola et al. [2009] Toni Heittola, Anssi Klapuri, and Tuomas Virtanen. Musical instrument recognition in polyphonic audio using source-filter model for sound separation. In ISMIR, pages 327–332, 2009.
- Hershey et al. [2017] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. Cnn architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp), pages 131–135. IEEE, 2017.
- McFee et al. [2015] Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference, volume 8, 2015.
- MTG upf [2019] MTG upf. Essentia open-source library and tools for audio and music analysis, description and synthesis, 2019. URL https://essentia.upf.edu/documentation/index.html. [Online; accessed 23-November-2019].
- Murthy and Koolagudi [2018] YV Murthy and Shashidhar G Koolagudi. Content-based music information retrieval (cb-mir) and its applications toward the music industry: A review. ACM Computing Surveys (CSUR), 51(3):45, 2018.
- Pedregosa et al. [2011] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
- Toghiani-Rizi and Windmark [2017] Babak Toghiani-Rizi and Marcus Windmark. Musical instrument recognition using their distinctive characteristics in artificial neural networks. arXiv preprint arXiv:1705.04971, 2017.
- Wikipedia contributors [2019] Wikipedia contributors. Mel-frequency cepstrum — Wikipedia, the free encyclopedia, 2019. URL https://en.wikipedia.org/w/index.php?title=Mel-frequency_cepstrum&oldid=917928298. [Online; accessed 23-November-2019].