Audio signal processing and music information retrieval (MIR) have significantly evolved with recent advances in deep learning. As a result, many existing software tools for audio analysis are lacking the functionalities required by the latest state of the art and/or cannot be connected straightforwardly with external software for deep learning, especially in the case of industrial deployment. For example, a typical pipeline for an audio tagging system may include computation and pre-processing of audio features (e.g., spectrograms) using audio analysis libraries (e.g., Essentia, Librosa, openSMILE or Madmom [1, 2, 3, 4]
), followed by deep learning frameworks for model inference relying on those features (e.g., TensorFlow or PyTorch[5, 6]). While all software in the pipeline may provide APIs in different languages, such as C++ and Python, and technically can be interconnected, there is a lack of efficient cross-platform software libraries incorporating all the steps in a unified pipeline to make its deployment and usage in applications as easy and efficient as possible. Some efforts have been devoted by TensorFlow (with tf.signal) and PyTorch (with torchaudio) to incorporate audio signal processing layers that can run on GPUs. Still, many deep learning practitioners rely on music/audio-specific pre-processing libraries many of which are not optimized for efficiency.
Essentia111https://essentia.upf.edu is an open-source library for audio and music analysis released under the AGPLv3 license and well known for its capability to serve as a basis for large-scale industrial applications as well as a rapid prototyping framework . Some of its key features are:
It is implemented in C++, with a great focus on efficiency, which makes it the fastest open-source library with the largest amount of features for audio analysis .
It supports a declarative approach to the implementation of signal processing pipelines with the “streaming mode” connecting algorithms for each computation step via ring buffers. This allows the user to streamline audio analysis processing input files or audio streams by chunks (in particular in real-time) and also limits memory usage, which can be crucial for many applications.
It has a Python interface. Programming in an interpreted language while all the data flow is ultimately controlled by optimized C++ code provides a balance between functionality and flexibility.
Given its focus on efficiency, flexibility of use, modularity and easy extensibility, we consider Essentia an attractive infrastructure to build efficient and modular deep learning pipelines for audio. A similar effort in the past led Essentia to integrate a collection of Support Vector Machine (SVM) classifiers based on engineered features and trained on in-house music collections (datasets) available at Music Technology Group (MTG).222https://acousticbrainz.org/datasets/accuracy These classifiers are publicly available and have been used extensively for research [8, 9, 10, 11, 12, 13] and in AcousticBrainz, an open database of music audio features  with over 13.5 million analyzed tracks. These models achieved very competitive results according to a standard cross-fold validation, but when some of the classifiers were assessed in the context of external data they showed very poor performance revealing low generalization capabilities . In addition, recent studies suggest that new approaches based on deep learning are able to outperform SVMs in audio tagging tasks [16, 17, 18, 19]. For these reasons, our goals are to implement a new set of algorithms in Essentia and develop new classifier models based on deep learning and capable of better generalization, which can be used for both research and industrial applications.
Unfortunately, deep learning models require large amounts of training data to perform well [16, 18, 20] and, in most scenarios, it is unreasonable to assume that large training databases are available. Considering that many Essentia use cases might be limited by the size of the datasets at hand, we limit our experiments to such cases and train models on small in-house datasets previously used for training SVM classifiers. Several studies have revealed the potential of transfer learning techniques for small training data in the context of audio auto-tagging [20, 21]. For this reason, we investigate the generalization capabilities of this approach on our datasets.
In short, transfer learning takes advantage of the knowledge acquired on an external (source) task, where more training data is available, to improve performance on the target task where data is scarce. Generally, this is done by fine-tuning the pre-trained model [20, 21] or by using it as a (fixed) feature extractor [17, 19]. In our work, we opt for the latter and compare such transfer learning models with (i) deep learning models trained from scratch and (ii) the SVM-classifiers based on engineered audio features.
The rest of the paper is structured as follows: we first introduce the algorithms we have developed to integrate TensorFlow in Essentia and present a number of state-of-the-art CNN models available out of the box in Section 2. In Section 3 we describe the process of training and evaluation for new classifiers based on our in-house datasets. We conclude in Section 4.
2 Bridging TensorFlow and Essentia
Our goal is to extend the Essentia framework to support deep learning models with fast inference times and a capability to run on CPUs or acceleration hardware such as GPUs. While we could have considered Python-based solutions similar to Madmom , we are interested in an integrated C++ solution to take advantage of fast computational speed which is crucial in many applications. The decision to use TensorFlow instead of other options such as PyTorch  was motivated by the stability of its C API,333https://www.tensorflow.org/install/lang_c its active development to keep up with the state of the art, and a huge availability of existing research relying on it.
To this end, we have developed a set of algorithms that allow reading frozen models from Protobuf files, generating tensors from 1D or 2D audio representations and running TensorFlow sessions. The algorithms were implemented with the following design criteria:
All dataflow between algorithms for audio feature extraction and model inference should be implemented in C++ without any overhead conversion to Python. We also decided to use TensorFlow frozen models where variables are converted to constants allowing us to remove some training operations.
Flexibility. The deep learning field moves-on fast. Therefore, generic support for any TensorFlow architecture should be provided. This can be done by loading both the architecture and the weights from external files instead of hard-coding any particular architecture. Importantly, it is also possible to import the models from other frameworks via intermediate formats such as ONNX.
Access to intermediate layers. Sometimes intermediate layers of a model are valuable as they can be used, for example, as features for other tasks. For this reason, it should be possible to extract the output tensors from any layer.
Real-time analysis. Being able to run computations in real time is one of the key features of Essentia that should be supported by its deep-learning algorithms. The latency and the overall real-time capability ultimately depend on the design of a model, its computational cost for inference, and/or receptive field.
The provided functionality does not include training of the TensorFlow models, only inference. Users can be flexible in selecting the way how to train their models as long as they ensure the compatibility of the input features used for training with their implementation in Essentia for inference. Ideally, users could also use Essentia features on the training stage in order to ensure the best compatibility. Many deep learning models proposed in research have been trained using features from different software, but they can be also reproduced in Essentia as its algorithms are sufficiently configurable for most input audio features. For example, in the case of mel-spectrograms, Essentia can reproduce virtually any existing common mel implementation.
Most TensorFlow models can then be made compatible with Essentia by freezing and serializing them into Protobuf files. This is a simple process that can be easily done using available Python scripts.
As an example of the efficiency our framework, we compared inference times for MusiCNN 
using the original implementation in Python and our algorithms called from Essentia’s Python bindings. The original feature extraction time, based on Librosa, took 6.51 seconds compared to 2.30 for Essentia. Loading the model and predicting took 2.07s and 1.66s, respectively. In total, considering the extra overhead of dataflow, the difference is 8.60 to 3.34 seconds, meaning that our framework is 2.5 times faster for the entire end-to-end process from loading audio to inference. These time estimations were done averaging 10 trials of analysis of a 3:27 MP3 file on an i7 6700 CPU.
All new algorithms are available as a part of Essentia. We provide a tutorial with examples of how to install and use the framework, create TensorFlow frozen models and run those models to generate predictions on the example of music auto-tagging.444https://mtg.github.io/essentia-labs/ In addition, we have incorporated a number of state-of-the-art models from audio tagging research into Essentia, listed in Table 1 and made them publicly available on the official website.555https://essentia.upf.edu/models/ We use some of these models in our experiments in Section 3.
3 Training CNN classifiers for Essentia
There are many annotated in-house music collections that are used extensively in Essentia and a number of related large-scale projects such as AcousticBrainz . These collections are summarized in Table 2. Even though their scale is not comparable with many recent datasets, they are interesting to work with because they represent a typical use-case of a small amount of data available for a particular application. In addition our intention is to improve the classifiers that have already been used in research and not to challenge the state of the art on any particular task.
In this section we take advantage of these datasets to train CNN classifiers in order to improve on the SVM-based models available in Essentia. Our model creation process is divided in two steps. First, we focus on the genre recognition task for which we have additional validation datasets to select the best architecture and training strategy. Next we use them to train classifiers for all our in-house music collections.
|genre-dortmund||alternative, blues, electronic, folkcountry, funksoulrnb, jazz, pop, raphiphop, rock||1820 exc.|
|genre-gtzan||blues, classic, country, disco, hip hop, jazz, metal, pop, reggae, rock||1000 exc.|
|genre-rosamerica||classic, dance, hip hop, jazz, pop, rhythm and blues, rock, speech||400 ft.|
|mood-acoustic||acoustic, not acoustic||321 ft.|
|mood-electronic||electronic, not electronic||332 ft./exc.|
|mood-aggressive||aggressive, not aggressive||280 ft.|
|mood-relaxed||not relaxed, relaxed||446 ft./exc.|
|mood-happy||happy, not happy||302 exc.|
|mood-sad||not sad, sad||230 ft./exc.|
|mood-party||not party, party||349 exc.|
|danceability||danceable, not dancable||306 ft.|
|voice/instrumental||voice, instrumental||1000 exc.|
|gender||female, male||3311 ft.|
|timbre||bright, dark||3000 exc.|
|tonal/atonal||atonal, tonal||345 exc.|
3.1 Architectures, training strategies and experimental setup
MusiCNN is a musically-motivated CNN . It uses vertical and horizontal convolutional filters aiming to capture timbral and temporal patterns, respectively. The model contains 6 layers and 787,000 parameters.
is an architecture from computer vision based on a deep stack of 33 convolutional filters commonly used for audio [24, 18]. We consider two different implementations. VGG-I
contains 5 layers with 128 filters each. Batch normalization and dropout are applied before each layer. The model has 605,000 trainable parameters. VGG-II follows the configuration “E”  from the original implementation for computer vision, with the difference that the number of output units is set to 3087 . This model has 62 million parameters.
We compare transfer learning to the models trained from scratch:
Transfer learning models
. A pre-trained model is loaded and only a small neural network connected to its penultimate layer is trained. The models (MusiCNN, VGG-I and VGG-II) were previously trained on two audio tagging tasks:
AudioSet contains 1.8 million audio clips from Youtube annotated with the AudioSet taxonomy , not specific to music.
MusiCNN and VGG-I are pre-trained on MSD-train, while VGG-II uses AudioSet. We considered two variants of transfer learning back-ends for these models in a preliminary experiment: (A) one fully connected output layer of units and (B) two fully connected layers of 100 and units, respectively, where is the number of classes in the employed dataset. The variant A provided the best results for MusiCNN and VGG-I, while the variant B gave the best results for VGG-II. We used these best configurations for each model in the rest of our study.
Models trained from scratch. The parameters of MusiCNN and VGG-I are randomly initialized and all the layers are trained.
All our CNNs were trained on mel-spectrograms. For the models trained from scratch we used the implementation in Essentia with 96 bands. In the case of transfer learning, we used 96 bands for MusiCNN and VGG-I, and 64 bands for VGG-II. We opted for the feature extractors used by the authors of the pre-trained models, but we re-implemented those mel-spectrograms for inference.
To estimate the accuracy of each model we conduct a stratified 5-fold cross-validation, where each training split is further divided into 80% train and 20% validation subsets. After this, to take advantage of all data possible, the final CNN models that we evaluate on external datasets are trained using the 80% of the entire data (20% is kept for validation).
The models are trained on mini-batches of 32 samples. Each sample is a randomly selected segment of 3 seconds from a different track of the training set. SGD employing Adam is used as the optimizer. The number of epochs is 600 for the models trained from scratch. The transfer learning models are trained for 150 epochs, as those models require less iterations to converge. All the models are initialized with a learning rate of 0.001. If the loss obtained on the validation set has not decreased for the last 75 epochs, the learning rate is reduced by half.
The baseline for our experiments is comprised of the SVM classifiers available in Essentia.777We used the latest Essentia 2.1-beta5 version. They rely on a combination of low-, mid- and high-level music audio features describing timbre, rhythm and tonality . The best parameters for the SVMs are found in a grid search in the 5-fold cross-validation, and the final SVM models that we evaluate are trained on the entire data.888https://essentia.upf.edu/documentation/FAQ.html
We used standard TensorFlow routines in Python for training and then stored the models into Protobuf files to be used in Essentia.
|Genre dataset||Baseline||Trained from scratch||Transfer learning|
|SVM||MusiCNN||VGG I||MusiCNN (MSD-train)||VGG-I (MSD-train)||VGG-II (Audioset)|
|Cross-collection evaluation on MSD-test|
|Cross-collection evaluation on MTG-Jamendo-test|
3.2 Evaluation on genre recognition tasks
Given the small size of our datasets, overfitting can be an issue and the results of the 5-fold cross-validation can be unreliable. For this reason, we conduct a cross-collection evaluation that consists in evaluating the models on an independent source of music and annotations following the methodology proposed in . This allows us to identify the model architecture and training strategy with the best generalization capabilities.
Unfortunately, we are lacking such external datasets to evaluate all our classifiers, but we are able to do it for the task of genre classification for which we have three datasets: genre-dortmund, genre-gtzan and genre-rosamerica. As external data sources we use two datasets, both containing tag annotations including genres:
MSD-test is the test set of 28,000 tracks from the MSD dataset with Lastfm tags. Note that MSD has been also used for the pre-trained MusiCNN and VGG-I models, but they were trained on the train split and there is no overlap.
Following , we took advantage of the taxonomy used by the Lastgenre plugin for Beets101010http://beets.io to generate ground-truth genre labels from the tags in MSD-test and MTG-Jamendo-test. We only considered tracks with one or more tags matching an element in the taxonomy. Those tags were mapped to its parent in the hierarchy (e.g., “progressive rock” to “rock”), unless there was a direct match to one of the classes of our classifiers. The resulting genre annotations are multi-label, and to evaluate each group of classifiers (corresponding to one of our in-house datasets) we use the subset of tracks that have a ground-truth label matching one of the classes. That is, we only give them music by genres they can theoretically predict. A prediction is considered correct if it matches one of the labels of the track.
contains the balanced accuracies obtained by each architecture and training strategy in both the 5-fold cross-validation and cross-collection evaluation on MSD-test and MTG-Jamendo-test. These accuracies are computed by averaging the individual recall values obtained for each class. For the cross-validation results we indicate the standard deviation of the balanced accuracies across folds. Our results show that transfer learning models, in particular VGG-II with AudioSet, consistently outperform the SVMs and the CNNs trained from scratch. Interestingly, the AudioSet model is not specifically trained for music content, but it is still capable to get the best results, potentially due to its training data size and complexity.
3.3 Training Essentia models
As genre classification is a complex problem, we can expect that the conclusions of the previous section will also benefit the rest of our classification tasks. Therefore, we use the winning architecture and training strategy from the previous experiment to generate new models for the rest of the tasks including mood classification and other high-level music description.
Again we used our SVM classifiers as a baseline. For further quality assessment of our models, we decided to manually annotate a subset of MTG-Jamendo-test by the classes of our classifiers. This subset contains approximately 1,000 tracks selected by a stratified approach [27, 28] in order to maximize the variety of music according to the associated tags. The annotations were performed solely utilizing the labels from the taxonomies of our in-house datasets. The final number of tracks used to evaluate each model varies (from 599 to 1000) as we discarded the tracks that could not be matched to any taxonomy class by the annotators. We use these annotations as a ground truth to compare predictions by the transfer-learning models to the SVMs in terms of balanced accuracies.
Table 4 presents the results for all tasks, including 5-fold cross-validation on the original datasets used for training as well as evaluation on our manually annotated subset of MTG-Jamendo-test. As we can see, VGG-II with AudioSet leads to improvement in the mean accuracies over SVM baseline in the 5-fold cross-validation, however the difference is not statistically significant in many cases. Meanwhile, the results on the manually annotated subset of MTG-Jamendo-test show that our CNN models perform better except for the models for mood-party, danceability, gender and timbre.
It is important to note that we did not take much care on the optimization of the hyper-parameters of the models, still getting decent improvements on a number of the datasets and opening possibilities for future work. Overall, we can see better generalization of the CNN models in the cross-collection evaluation for many of the datasets.
Balanced accuracies for 5-fold cross-validation and evaluation on a manually annotated subset of MTG-Jamendo-test. Statistically significant improvements over the SVMs according to an independent samples t-test () are marked in bold.
We have presented our development effort to add support for generic TensorFlow models in Essentia, a C++ library for audio and music analysis with Python bindings, being the first effort of its kind to integrate arbitrary deep learning models into an MIR library. The new functionality for using such models is designed to be fast, easy and flexible, and it is especially attractive for applications requiring computational efficiency, such as large-scale analysis on millions of tracks, real-time processing, or inference on weak devices.
We provide a number of CNN audio tagging models, ported from Python implementations made by other researchers and our own classifier models trained using in-house datasets. For the latter models we apply transfer learning techniques that outperform previous Essentia classifiers based on SVMs. All of these models are publicly available for researchers and practitioners, and we plan to add more models in the future.
-  Dmitry Bogdanov, Nicolas Wack, Emilia Gómez, Sankalp Gulati, Perfecto Herrera, O. Mayor, Gerard Roma, Justin Salamon, J. R. Zapata, and Xavier Serra, “Essentia: an audio analysis library for music information retrieval,” in International Society for Music Information Retrieval Conference (ISMIR’13), 2013, pp. 493–498.
-  Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto, “librosa: Audio and music signal analysis in Python,” in Python in Science Conference (SciPy’15), 2015.
-  Florian Eyben, Martin Wöllmer, and Björn Schuller, “Opensmile: the munich versatile and fast open-source audio feature extractor,” in ACM International Conference on Multimedia (MM’10), 2010, pp. 1459–1462.
-  Sebastian Böck, Filip Korzeniowski, Jan Schlüter, Florian Krebs, and Gerhard Widmer, “Madmom: A new Python audio and music signal processing library,” in ACM International Conference on Multimedia (MM’16), 2016, pp. 1174–1178.
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey
Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard,
“Tensorflow: A system for large-scale machine learning,”in USENIX Symposium on Operating Systems Design and Implementation (OSDI’16), 2016, pp. 265–283.
-  Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer, “Automatic differentiation in PyTorch,” in NIPS Autodiff Workshop, 2017.
-  David Moffat, David Ronan, and Joshua D Reiss, “An evaluation of audio feature extraction toolboxes,” in International Conference on Digital Audio Effects (DAFx’15), 2015.
-  Nicolas Wack, Enric Guaus, Cyril Laurier, Ricard Marxer, Dmitry Bogdanov, Joan Serrà, and Perfecto Herrera, “Music Type Groupers (MTG): Generic Music Classification Algorithms,” in Music Information Retrieval Evaluation Exchange (MIREX’09), 2009.
-  N. Wack, C. Laurier, O. Meyers, R. Marxer, D. Bogdanov, J. Serra, E. Gomez, and P. Herrera, “Music classification using high-level models,” in Music Information Retrieval Evaluation Exchange (MIREX’10), 2010.
-  C. Laurier, Automatic Classification of Musical Mood by Content-Based Analysis, Ph.D. thesis, Universitat Pompeu Fabra, 2011.
-  D. Bogdanov, J. Serrà, N. Wack, P. Herrera, and X. Serra, “Unifying low-level and high-level music similarity measures,” IEEE Transactions on Multimedia, vol. 13, no. 4, pp. 687–701, 2011.
-  Dmitry Bogdanov, Martín Haro, Ferdinand Fuhrmann, Anna Xambó, Emilia Gómez, and Perfecto Herrera, “Semantic audio content-based music recommendation and visualization based on user preference examples,” Information Processing & Management, vol. 49, no. 1, pp. 13–33, 2013.
-  Kai R. Fricke, David M. Greenberg, Peter J. Rentfrow, and Philipp Yorck Herzberg, “Computer-based music feature analysis mirrors human perception and can be used to measure individual music preference,” Journal of Research in Personality, vol. 75, pp. 94–102, 2018.
-  Alastair Porter, Dmitry Bogdanov, Robert Kaye, Roman Tsukanov, and Xavier Serra, “Acousticbrainz: a community platform for gathering music information obtained from audio,” in International Society for Music Information Retrieval Conference (ISMIR’15), 2015.
-  Dmitry Bogdanov, Alastair Porter, Herrera Boyer, Xavier Serra, et al., “Cross-collection evaluation for music classification tasks,” in International Society for Music Information Retrieval Conference (ISMIR’16), 2016.
-  Jordi Pons, Oriol Nieto, Matthew Prockup, Erik Schmidt, Andreas Ehmann, and Xavier Serra, “End-to-end learning for music audio tagging at scale,” in International Society for Music Information Retrieval Conference (ISMIR’18), 2018.
Jongpil Lee, Jiyoung Park, Keunhyoung Kim, and Juhan Nam,
“Samplecnn: End-to-end deep convolutional neural networks using very small filters for music classification,”Applied Sciences, vol. 8, no. 1, pp. 150, 2018.
-  Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al., “CNN architectures for large-scale audio classification,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’17), 2017, pp. 131–135.
-  Jordi Pons and Xavier Serra, “musicnn: Pre-trained convolutional neural networks for music audio tagging,” 2019.
-  Jordi Pons, Joan Serrà, and Xavier Serra, “Training neural audio classifiers with few data,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’19), 2019, pp. 16–20.
-  Keunwoo Choi, György Fazekas, Mark Sandler, and Kyunghyun Cho, “Transfer learning for music classification and regression tasks,” in International Society for Music Information Retrieval Conference (ISMIR’17), 2017.
-  Thierry Bertin-Mahieux, Daniel PW Ellis, Brian Whitman, and Paul Lamere, “The million song dataset,” in International Society for Music Information Retrieval Conference (ISMIR’11), 2011.
-  Edith Law, Kris West, Michael I Mandel, Mert Bay, and J Stephen Downie, “Evaluation of algorithms using games: The case of music tagging,” in International Society for Music Information Retrieval Conference (ISMIR’09), 2009.
-  Keunwoo Choi, Gyorgy Fazekas, and Mark Sandler, “Automatic tagging using deep convolutional neural networks,” in International Society for Music Information Retrieval Conference (ISMIR’16), 2016.
-  Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2014.
-  Dmitry Bogdanov, Minz Won, Philip Tovstogan, Alastair Porter, and Xavier Serra, “The MTG-Jamendo dataset for automatic music tagging,” in Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML’19), 2019.
-  Konstantinos Sechidis, Grigorios Tsoumakas, and Ioannis Vlahavas, “On the stratification of multi-label data,” Machine Learning and Knowledge Discovery in Databases, pp. 145–158, 2011.
-  P. Szymański and T. Kajdanowicz, “A scikit-based Python environment for performing multi-label classification,” ArXiv e-prints, Feb. 2017.