Python code to reproduce our article "Toward faultless content-based playlists generation for instrumentals"
This study deals with content-based musical playlists generation focused on Songs and Instrumentals. Automatic playlist generation relies on collaborative filtering and autotagging algorithms. Autotagging can solve the cold start issue and popularity bias that are critical in music recommender systems. However, autotagging remains to be improved and cannot generate satisfying music playlists. In this paper, we suggest improvements toward better autotagging-generated playlists compared to state-of-the-art. To assess our method, we focus on the Song and Instrumental tags. Song and Instrumental are two objective and opposite tags that are under-studied compared to genres or moods, which are subjective and multi-modal tags. In this paper, we consider an industrial real-world musical database that is unevenly distributed between Songs and Instrumentals and bigger than databases used in previous studies. We set up three incremental experiments to enhance automatic playlist generation. Our suggested approach generates an Instrumental playlist with up to three times less false positives than cutting edge methods. Moreover, we provide a design of experiment framework to foster research on Songs and Instrumentals. We give insight on how to improve further the quality of generated playlists and to extend our methods to other musical tags. Furthermore, we provide the source code to guarantee reproducible research.READ FULL TEXT VIEW PDF
Python code to reproduce our article "Toward faultless content-based playlists generation for instrumentals"
Playlists are becoming the main way of consuming music (Song et al., 2012; Wikström, 2015; Choi et al., 2016; Nakano et al., 2016). This phenomenon is also confirmed on web streaming platforms, where playlists represent 40% of musical streams as stated by De Gemini from Deezer111http://deezer.com, accessed on 27 September 2017 during the last MIDEM222http://musically.com/2016/06/05/music-curation-and-playlists-the-new-music-battleground-midem, accessed on 27 September 2017. Playlists also play a major role in other media like radios, personal devices such as laptops, smartphones (Thalmann et al., 2016), MP3 Players (Nettamo et al., 2006), and connected speakers. Users can manually create their playlists, but a growing number of them listens to automatically generated playlists (Uitdenbogerd and Schyndel, 2002) created by music recommender systems (Yoshii et al., 2007; Schedl et al., 2015) that suggest tracks fitting the taste of each listener.
Such playlist generation implicitly requires selecting tracks with a common characteristic like genre or mood. This equates to annotating tracks with meaningful information called tags (Jäschke et al., 2007). A musical piece can gather one or multiple tags that can be comprehensible by common human listeners such as "happy", or not like "dynamic complexity" (Streich, 2006; Laurier and Herrera, 2007). A tag can also be related to the audio content, such as "rock" or "high tempo". Moreover, editorial writers can provide tags like "summer hit" or "70s classic". Turnbull et al. (2008) distinguish five methods to collect music tags. Three of them require humans, e.g. social tagging websites (Shardanand and Maes, 1995; Breese et al., 1998; Levy and Sandler, 2007; Shepitsen et al., 2008) used by Last.fm333https://www.last.fm/, accessed on 27 September 2017, music annotation games (Law et al., 2007; Turnbull et al., 2007; Mandel and Ellis, 2008), and online polls (Turnbull et al., 2008). The last two tagging methods are computer-based and include text mining web-documents (Whitman and Ellis, 2004; Knees et al., 2007) and audio content analysis (Tzanetakis and Cook, 2002; Bertin-Mahieux et al., 2010; Prockup et al., 2015). Multiple drawbacks stand out when reviewing the different tagging methods. Indeed, human labelling is time-consuming (Kim and Whitman, 2002; Skowronek et al., 2006) and prone to mistakes (Sturm, 2013, 2015). Furthermore, human labelling and text mining web-documents are limited by the ever-growing musical databases that increase by 4,000 new CDs by month (Pachet and Roy, 1999) in western countries. Hence, this amount of music cannot be labelled by humans and implies that some tracks cannot be recommended because they are not rated or tagged (Eck et al., 2007; Li et al., 2007; Schafer et al., 2007; Schlüter and Grill, 2015). This lack of labelling is a vicious circle in which unpopular musical pieces remain poorly labelled, whereas popular ones are more likely to be annotated on multiple criteria (Eck et al., 2007) and therefore found in multiple playlists444http://www.billboard.com/biz/articles/news/digital-and-mobile/5944950/the-echo-nest-cto-brian-whitman-on-spotify-deal-man-vs, accessed on 27 September 2017. This phenomenon is known as the cold start issue or as the data sparsity problem (Song et al., 2012). Text-mining web documents is tedious and error-prone, as it implies collecting and sorting redundant, contradictory, and semantic-based data from multiple sources. Audio content-based tagging is faster than human labelling and solves the major problems of cold starts, popularity bias, and human-gathered tags (Logan, 2002; Hoashi et al., 2003; Celma et al., 2005; Eck et al., 2007; Sordo et al., 2007; Turnbull et al., 2007; Mandel and Ellis, 2008; Tingle et al., 2010). A makeshift solution combines the multiple tag-generating methods (Bu et al., 2010) to produce robust tags and to process every track. However, audio content analysis alone remains improvable for subjective and ambivalent tags such as the genre (Hsu et al., 2016; Jeong and Lee, 2016; Lu et al., 2016; Oramas et al., 2016).
In light of all these issues, a new paradigm is needed to rethink the classification problem and focus on a well-defined question555http://ejhumphrey.com/?p=302, accessed on 27 September 2017 that needs solving (Sturm, 2016) to break the "glass ceiling" (Wiggins, 2009)
in Music Information Retrieval (MIR). Indeed, setting up a problem with a precise definition will lead to better features and classification algorithms. Certainly, cutting-edge algorithms are not suited for faultless playlist generation since they are built to balance precision and recall. The presence of few wrong tracks in a playlist diminishes the trust of the user in the perceived service quality of a recommender system(Chau et al., 2013) because users are more sensitive to negative than positive messages (Yin et al., 2010)
. A faultless playlist based on a tag needs an algorithm that achieves perfect precision while maximizing recall. It is possible to partially reach this aim by maximizing the precision and optimizing the corresponding recall, which is a different issue than optimizing the f-score. A low recall is not a downside when considering the large amount of tracks available on audio streaming applications. For example, Deezer provides more than 40 million tracks666https://www.deezer.com/features, accessed on 27 September 2017 in 2017. Moreover, the maximum playlist size authorized on streaming platforms varies from 1,000777http://support.deezer.com/hc/en-gb/articles/201193652-Is-there-a-limit-to-the-amount-of-tracks-in-a-playlist-, accessed on 27 September 2017 for Deezer to 10,000888https://community.spotify.com/t5/Desktop-Linux-Windows-Web-Player/Maximum-songs-on-playlists/td-p/108021, accessed on 27 September 2017 for Spotify, while YouTube999https://developers.google.com/youtube/2.0/developers-guide-protocol-playlists?csw=1, accessed on 27 September 2017 and Google Play Music have a limit of 5,000 tracks per playlist. However, there is a mean of 27 tracks in the private playlists of the users from Deezer with a standard variation of 70 tracks101010Personal communication from Manuel Moussallam, Deezer R&D team. Thus, it seems feasible to create tag-based playlists containing hundreds of tracks from large-scale musical databases.
In this article, we focus on improving audio content analysis to enhance playlist generation. To do so, we perform Songs and Instrumentals Classification (SIC) in a musical database. Songs and Instrumentals are well-defined, relatively objective, mutually exclusive, and always relevant (Gouyon et al., 2014). We define a Song as a musical piece containing one or multiple singing voices either related to lyrics or onomatopoeias and that may or may not contain instrumentation. Instrumental is thus defined as a musical piece that does not imply any sound directly or indirectly coming from the human voice. An example of an indirect sound made by the human voice is the talking box effect audible in Rocky Mountain Way from Joe Walsh.
People listen to instrumental music mostly for leisure. However, we chose to focus on Instrumental detection in this study because Instrumentals are essential in therapy (Rosenblatt, 2015) and learning enhancement methods (Suárez et al., 2016; Zhao and Kuhl, 2016). Nevertheless, audio content analysis is currently limited by the distinction of singing voices from instruments that mimic voices. Such distinction mistakes lead to plenty of Instrumental being labelled as Song. Aerophones and fretless stringed instruments, for example, are known to produce similar pitch modulations as the human voice (Rao et al., 2009; Panteli et al., 2017). This study focuses on improving Instrumental detection in musical databases because the current state-of-the-art algorithms are unable to generate a faultless playlist with the tag Instrumental (Ghosal et al., 2013; Bayle et al., 2016). Moreover, precision and accuracy of SIC algorithms decline when faced with bigger musical databases (Bayle et al., 2016; Bogdanov et al., 2016). The ability of these classification algorithms to generate faultless playlists is consequently discussed here.
In this paper, we define solutions to generate better Instrumental and Song playlists. This is not a trivial task because Singing Voice Detection (SVD) algorithms cannot directly be used for SIC. Indeed, SVD aims at detecting the presence of singing voice at the frame scale for one track, but related algorithms produce too many false positives (Lehner et al., 2014), especially when faced with Instrumentals. Our work addresses this issue and the major contributions are:
The first review of SIC systems in the context of playlist generation.
The first formal design of experiment of the SIC task.
We show that the use of frame features outperforms the use of global track features in the case of SIC and thus diminishes the risk of an algorithm being a "Horse".
A knowledge-based SIC algorithm —easily explainable— that can process large musical database whereas state-of-the-art algorithms cannot.
A new track tagging method based on frame predictions that outperforms the Markov model in terms of accuracy and f-score.
A demonstration that better playlists related to a tag can be generated when the autotagging algorithm focuses only on this tag.
As the major problem in MIR tasks concerns the lack of a big and clean labelled musical database (Yoshii et al., 2007; Casey et al., 2008), we thus detail in Section 2 the use of SATIN (Bayle et al., 2017), which is a persistent musical database. This section also details the solution we use to guarantee reproducibility over SATIN for our research code. In Section 3 we describe the state-of-the-art methods in SIC and we detail their implementation in Section 4. We then evaluate their performances and limitations in three experiments from Section 5 to Section 7. Section 8 settles the formalism for the new paradigm as described by (Sturm, 2016) and compares our new proposed method to the state-of-the-art methods. We finally discuss our results and perspectives in Section 9.
The musical database considered in this paper is twofold. The first part of the musical database comprises 186 musical tracks evenly distributed between Songs and Instrumentals. Tracks were chosen from previously existing musical databases. This first part of our musical database is further referred as . All tracks are available for research purposes and are commonly used by the MIR community (Ramona et al., 2008; Bittner et al., 2014; Lehner et al., 2014; Liutkus et al., 2014; Schlüter and Grill, 2015; Schlüter, 2016). includes tracks from the MedleyDB database (Bittner et al., 2014), the ccMixter database (Liutkus et al., 2014), and the Jamendo database (Ramona et al., 2008).
The Jamendo database131313http://www.mathieuramona.com/wp/data/jamendo, accessed on 27 September 2017 has been proposed by Ramona et al. (2008) and contains 93 Songs and the corresponding annotations at the frame scale concerning the presence of a singing voice. These Songs have been retrieved from Jamendo Music141414https://www.jamendo.com, accessed on 27 September 2017.
We chose tracks from the Jamendo database because the MIR community already provided ground truths concerning the presence of a singing voice at the frame scale (Ramona et al., 2008). These frame scale ground truths are indeed needed for the training process of the algorithm proposed in Section 8. There are only 93 Songs because producing corresponding frame scale ground truths is a tedious task, which is, to some extent, ill-defined (Kim and Whitman, 2002). We chose tracks from the MedleyDB database because they are tagged as per se Instrumentals, whereas we chose tracks from the ccMixter database because they were meant to accompany a singing voice. Choosing such different tracks helps to reflect the diversity of Instrumentals.
The second part of the musical database comes from the SATIN (Bayle et al., 2017) database and will be referred to as . is uneven and references 37,035 Songs and 4,456 Instrumentals, leading to a total of 41,491 tracks that are identified by their International Standard Recording Code (ISRC151515http://isrc.ifpi.org/en, accessed on 27 September 2017) provided by the International Federation of the Phonographic Industry (IFPI161616http://www.ifpi.org/, accessed on 27 September 2017). These standard identifiers allow a unique identification of the different releases of a track over the years and across the interpretations from different artists. The corresponding features of the tracks contained in SATIN have been extracted for Bayle et al. (2017) by Simbals171717http://www.simbals.com, accessed on 27 September 2017 and Deezer and are stored in SOFT1. To allow reproducibility, we provide the list of ISRC used for the following experiments along with our reproducible code on our GitHub account181818https://github.com/ybayle/SMC2017, accessed on 27 September 2017. The point of sharing the ISRC for each track is to facilitate result comparison between future studies and our own.
As far as we know, only a few recent studies have been dedicated to SIC (Ghosal et al., 2013; Hespanhol, 2013; Zhang and Kuo, 2013; Gouyon et al., 2014; Bayle et al., 2016) compared to the extensive literature devoted to music genre recognition (Sturm, 2014), for example. The SIC task in a database must not be confused with the SVD task that tries to identify the presence of a singing voice at the frame scale for one track. In this section, we describe existing algorithms for SIC and we benchmark them in the next section.
, the authors posit that Songs differ from Instrumentals in the stable frequency peaks of the spectrogram visible in MFCC. The authors then categorize an in-house database of 540 tracks evenly distributed with a classifier based on Random Sample and Consensus (RANSAC)(Fischler and Bolles, 1981; Ghosal et al., 2013). Their algorithm reaches an accuracy of 92.96% for a 2-fold cross-validation classification task. This algorithm will hereafter be denoted as GA.
. The seventeen low-level features extracted from each frame are normalized and consist of the zero crossing rate, the spectral centroid, the roll-off and flux, and the first thirteen MFCC. A linear Support Vector Machine (SVM) classifier is trained to output probabilities for the mean and the standard deviation of the previous low-level features from which tags are selected. The authors tested SVMBFF against three different musical databases comprising between 502 and 2,349 tracks. The f-score of SVMBFF ranges from 0.89 to 0.95 for Songs across the three musical databases. As for Instrumentals, the f-score is between 0.45 and 0.80. The authors did not comment on this substantial variation and readers can foresee that the poor performance in Instrumental detection is not yet well understood.
with an analysis frame of 93 ms and an overlap of 50%. VQMM then codes a signal using vector quantization (VQ) in a learned codebook. Afterwards, it estimates conditional probabilities in first-order Markov models (MM). The originality of this approach is found in the statistical language modelling. The authors tested VQMM against three different musical databases comprising between 502 and 2,349 tracks. The f-score of VQMM is comprised between 0.83 and 0.95 for Songs across the three musical databases. The f-score for Instrumentals is between 0.54 and 0.66. As for SVMBFF, the f-score of Instrumentals is lower than the f-score for Songs and depicts the difficulty to detect correctly Instrumentals, regardless of the musical database.
Gouyon et al. (2014) used a variation of the sparse representation classification (SRC) (Panagakis et al., 2009; Wright et al., 2009; Sturm, 2012; Sturm and Noorzad, 2012) applied to auditory temporal modulation features (AM). Gouyon et al. (2014) tested SRCAM against three different musical databases comprising between 502 and 2,349 tracks. The f-score of SRCAM is comprised between 0.90 and 0.95 for Songs across the three musical databases. The f-score for Instrumentals is between 0.57 and 0.80. As for SVMBFF and VQMM, the f-score for Instrumentals is lower than the f-score for Songs.
GA and SVMBFF use track scale features, whereas VQMM uses features at the frame scale. The three algorithms use thirteen MFCC, as those peculiar features are well known to capture singing voice presence in tracks. GA, SVMBFF, and VQMM are all tested under K-fold cross-validation on the same musical database. In next section, we compare the performances of these three algorithms on the musical database .
This section describes the implementation we used to benchmark existing algorithms for SIC. For all algorithms, the features proposed in SOFT1 were extracted and provided by Simbals and Deezer, thanks to the identifiers contained in SATIN. More technical details about the classification process can be found on our previously mentioned GitHub repository.
Ghosal et al. (2013) did not provide source code for reproducible research, so the YAAFE191919http://yaafe.sourceforge.net, accessed on 27 September 2017 toolbox was used to extract the corresponding MFCC in this study. The RANSAC algorithm provided by the Python package scikit-learn (Pedregosa et al., 2011) is used for classification.
SRCAM (Gouyon et al., 2014) is dismissed as the source code is in Matlab. Indeed, as tracks are stored on a remote industrial server, only algorithms for which the programming language is supported by our industrial partner can be computed. It would be interesting to implement SRCAM in Python or in C to assess its performance on , but SRCAM displays similar results as SVMBFF on three different musical databases (Gouyon et al., 2014).
In MIR, the aim of a classification task is to generate an algorithm capable of labelling each track of a musical database with meaningful tags. Previous studies in SIC used musical databases containing between 502 and 2,349 unique tracks and performed a cross-validation with two to ten folds (Ghosal et al., 2013; Hespanhol, 2013; Zhang and Kuo, 2013; Gouyon et al., 2014; Bayle et al., 2016). This section introduces a similar experiment by benchmarking existing algorithms on a new musical database. Table 1 displays the accuracy and the f-score of GA, SVMBFF, and VQMM with a 5-fold cross-validation classification task on .
The mean accuracy and f-score for the three algorithms do not differ significantly (one-way ANOVA, ,
). The high variance, low accuracy, and the f-score of the three algorithms indicate that these algorithms are too dependent on the musical database and are not suitable for commercial applications.
K-fold cross-validation on the same musical database is regularly used as an accurate approximation of the performance of a classifier on different musical databases. However, the size of the musical databases used in previous studies for SIC seems to be insufficient to assert the validity of any classification method (Livshin and Rodet, 2003; Guaus, 2009). Indeed, evaluating an algorithm on such small musical databases —even with the use of K-fold cross-validation— does not guarantee its generalization abilities because the included tracks might not necessarily be representative of all existing musical pieces (Ng, 1997). K-fold cross-validation on small-sized musical databases is indeed prone to biases (Herrera et al., 2003; Livshin and Rodet, 2003; Bogdanov et al., 2011), hence additional cross-database experiments are recommended in other scientific fields (Chudáček et al., 2009; Bekios-Calfa et al., 2011; Llamedo et al., 2012; Erdoğmuş et al., 2014; Fernández et al., 2015). Yet, creating a novel and large training set with corresponding ground truths consumes plenty of time and resources. In fact, in the big data era, a small proportion of all existing tracks are reliably tagged in the musical databases of listeners or industrials, as can be seen on Last.fm or Pandora222222https://www.pandora.com, accessed on 27 September 2017, for example. Thus, the numerous unlabelled tracks can only be classified with very few training data. The precision of the classification reached in these conditions is uncertain. The next section tackles this issue.
This section compares the accuracy and the f-score of GA, SVMBFF, and VQMM in a cross-database validation experiment. This experiment employs the test set that is 48 times bigger than the train set
. This is a scale-up experiment compared to the number of tracks used in the previous experiment. The reason for the use of a bigger test set is twofold. Firstly, this behaviour mimics conditions in which there are more untagged than tagged data, which is common in the musical industry. Secondly, existing classification algorithms for SIC cannot handle such an amount of musical data due to limitations of their own machine learning during the training process.
The test set of 8,912 tracks is evenly distributed between Songs and Instrumentals. As there are fewer Instrumentals than Songs, all of them are used while eight successive random samples of Songs in are taken without replacement. In Table 2, we compare the accuracy and f-score for GA, SVMBFF, and VQMM.
The accuracy and f-score of VQMM are higher than those of GA and SVMBFF, which may come from the use of local features by VQMM whereas GA and SVMBFF use track scale features. Indeed, the accuracy and the f-score of GA, SVMBFF, and VQMM differ significantly (Posthoc Dunn test, ). The accuracy of VQMM is respectively 0.086 (13.8%) and 0.143 (25.3%) higher than those of GA and SVMBFF. The f-score of VQMM is respectively 0.103 (17.1%) and 0.165 (30.4%) higher than those of GA and SVMBFF.
Compared to the results of the first experiment in the same collection validation, the three algorithms have a lower accuracy: -0.011 (-1.7%), -0.121 (-17.6%), and -0.047 (-6.2%), respectively for GA, SVMBFF, and VQMM. The same trend is visible for the f-score with -0.021 (-3.4%), -0.154 (-22.1%), and -0.046 (-6.1%), respectively for GA, SVMBFF, and VQMM.
The lower values of the accuracy and the f-score for the three algorithms in this experiment clearly depict the conjecture that same-database validation is not a suited experiment to assess the performances of an autotagging algorithm (Herrera et al., 2003; Livshin and Rodet, 2003; Guaus, 2009; Bogdanov et al., 2011). Moreover, the low values of the accuracy and the f-score of GA and SVMBFF in this untested database reveal that those algorithms might be "Horses" and might have overfit on the database proposed by their respective authors. GA, SVMBFF, and VQMM are thus limited in accuracy and f-score when a bigger musical database is used, even if its size is far from reaching the 40 million tracks available via Deezer. It is highly probable that the accuracy and f-score of GA, SVMBFF, and VQMM will diminish further when faced with millions of tracks.
Furthermore, there is an uneven distribution of Songs and Instrumentals in personal and industrial musical databases. Indeed, the salience of tracks containing singing voice in the recorded music industry is indubitable. Instrumentals represent 11 to 19% of all tracks in musical databases232323Personal communication from Manuel Moussallam, Deezer R&D team. The next section investigates the possible differences in performance caused by this uneven distribution.
This section evaluates the impact of a disequilibrium between Songs and Instrumentals on the precision, the recall, and the f-score of GA, SVMBFF, and VQMM. It was not possible to perform a comparison between the existing algorithms dedicated to SIC using a K-fold cross-validation because the implementation of VQMM and SVMBFF cannot train on such a great amount of musical features and crashed when we tried to do so. This section depicts a cross-database experiment with the 186 tracks of the balanced train set and the test set composed of 37,035 Songs (89%) and 4,456 Instrumentals (11%). We compare in Table 3 the accuracy and the f-score of GA, SVMBFF, and VQMM. To understand what is happening for the uneven distribution, we indicate which results are produced by a random classification algorithm further denoted RCA, i.e., where half of the musical database is randomly classified as Songs and the other half as Instrumentals.
VQMM, which uses frame scale features, has a higher accuracy and f-score than GA and SVMBFF, which use track scale features. GA and VQMM perform better than RCA in terms of accuracy and f-score, contrary to SVMBFF. The results of SVMBFF seem to depend on the context, i.e., on the musical database, because they display a lower global accuracy and f-score than RCA. The poor performances of SVMBFF might be explained by the imbalance between Songs and Instrumentals. As there is an uneven distribution between Instrumental and Songs in musical databases, we now analyse the precision, recall, and f-score for each class.
The Table 4 displays the precision and the recall for Songs detection for GA, SVMBFF, and VQMM against a random classification algorithm denoted RCA and via the algorithm AllSong that classifies every track as Song.
The precision for RCA and AllSong corresponds to the prevalence of the tag in the musical database. RCA has a 50% recall because half of the retrieved tracks is of interest, whereas AllSong has a recall of 100%. For GA, SVMBFF, and VQMM there is an increase in precision of respectively 0.02 (2.1%), 0.04 (4.8%), and 0.07 (7.5%) compared to RCA and AllSong.
When all tracks are tagged as Song in a musical database it leads to a similar f-score than the state-of-the-art algorithm because Songs are in majority in such database. Indeed, 100% of recall is achieved by AllSong, which significantly increases the f-score. The f-score is also increased by the high precision. This precision corresponds to the prevalence of Songs, which are in majority in our musical database. In sum, these results indicate that the best song playlist can be obtained by classifying every track of an uneven musical database as Song and that there is no need for a specific or complex algorithm. We study in the next section the impact of such random classification on Instrumentals.
The Table 5 displays the precision and the recall for Instrumentals detection for GA, SVMBFF, and VQMM against RCA and via the algorithm AllInstrumental that classifies every track as Instrumental.
As with AllSong, the precision for RCA and AllInstrumental corresponds to the prevalence of the instrumental tag in
. RCA has a 50% recall because half of the retrieved tracks is of interest, whereas AllInstrumental has a recall of 100%. The precision of GA, SVMBFF, and VQMM is 0.06 (57.3%), 0.02 (13.6%), and 0.19 (170.9%) higher respectively compared to RCA. As for previous experiments, the better performance of VQMM over GA and SVMBFF might be imputable to the use of features at the frame scale. Even if the use of features at the frame scale by VQMM provides better performances than GA and SVMBFF, the precision remains very low for Instrumentals as VQMM only reaches 29.8%.
In light of those results, guaranteeing faultless Instrumental playlists seems to be impossible with current algorithms. Indeed, Instrumentals are not correctly detected in our musical database with state-of-the-art methods that reach, at best, a precision of 29.8%. As for the detection of Songs, classifying every track as a Song in our musical database produces a high precision that is only slightly improved by GA, SVMBFF, or VQMM. A human listener might find inconspicuous the difference between a playlist generated by GA, SVMBFF, VQMM or by AllSong. However, producing an Instrumental playlist remains a challenge. The best Instrumental playlist feasible with GA, SVMBFF or VQMM contains at least 35 false positives —i.e., Songs— every 50 tracks, according to our experiments. It is highly probable that listeners will notice it. Thus, the precision of existing methods is not satisfactory enough to produce a faultless Instrumental playlist. One might think a solution could be to select a different operating point on the receiver operating characteristic (ROC) curve.
Figure 1 shows the ROC curve for the three algorithms and the area under the curve (AUC) for the Songs.
The ROC curves of Figure 1 indicate that the only operating point for 100% of true positive for GA, SVMBFF, and VQMM corresponds to 100% of false positive. Moreover, by design, there is a maximum of three operating points displayed by VQMM (Figure 1). Thus, a faultless playlist cannot be guaranteed by tuning the operating point of GA, SVMBFF, and VQMM.
To guarantee a faultless playlist, another idea would be to tune algorithms by impacting the class weighting. Indeed, we would guarantee 100% precision even if the recall plummets. Even if a recall of 1% is reached on the 40 million tracks of Deezer, it provides a sufficient amount of tracks for generating 40 playlists fulfilling the maximum size authorized on streaming platforms. Moreover, with such recall for the Instrumental tag, listeners can still apply another tag filter, such as "Jazz", to generate an Instrumental Jazz playlist, for example.
GA can be tuned, but not extensively enough to guarantee 100% of precision because it uses RANSAC. RANSAC is a regression algorithm robust to outliers and its configuration can only produce slight changes in performances, owing to its trade-off between accuracy and inliers. VQMM can also be tuned, but the increase in performance is limited due to the generalization made by the Markov model. SVMBFF can be tuned because class weights can be provided to SVM. However, after trying different class weightings, the precision of SVMBFF only slightly varies, as the features used are not discriminating enough.
We also could have performed an N-fold cross-validation on , but SVMBFF and VQMM cannot manage such an amount of musical data in the training phase.
We thus propose using different features and algorithms to generate a better instrumental playlist than the ones possible with state-of-the-art algorithms.
Experiments in previous sections indicate that GA, SVMBFF, and VQMM failed to generate a satisfactory enough Instrumental playlist out of an uneven and bigger musical database. As previously mentioned, such a playlist requires the highest precision possible while optimizing the recall. GA, SVMBFF, and VQMM might be "Horses" (Sturm, 2014), as they may not be addressing the problem they claim to solve. Indeed, they are not dedicated to the detection of singing voice without lyrics such as onomatopoeias or the indistinct sound present in the song Crowd Chant from Joe Satriani, for example. To avoid similar mistakes, a proper goal (Sturm, 2016) has to be clarified for SIC. Indeed, a use case, a formal design of experiments (DOE) framework, and a feedback from the evaluation to system design are needed.
Our use case is composed of four elements: the music universe (), the music recording universe (), the description universe (), and a success criterion. is composed of the polyphonic recording excerpts of the music in . Songs and Instrumentals are the two classes of . The success criterion is reached when an Instrumental playlist without false positives is generated from autotagging.
Six treatments are applied. Two are control treatments (Random Classification and the classification of every track as Instrumental), i.e. baselines. Three treatments are state-of-the-art methods (GA, VQMM, and SVMBFF) and the last treatment is the proposed methodology. The experimental units and the observational units are the entire collection of audio recordings. As no cross-validation is processed, there is a unique treatment structure. There are two responses model since our proposed algorithm has a two-stage process. The first response model is binary because a track is either Instrumental or not. The second response model is composed of the aggregate statistics (precision and recall). The generated playlist is the treatment parameter. The feedback is constituted of the number of Instrumentals in the final playlist. The experimental design of features and classifiers are detailed in the following section. The treatment parameter is the generalization process made by our proposed algorithm, since this is the difference between the state-of-the-art algorithms and our proposed algorithm. The materials in the DOE comes from the database SATIN (Bayle et al., 2017). We describe below the music universe () —i.e. SATIN— and its biases. The biases in the database used in previous studies might have cause GA, VQMM, and SRCAM to overfit. The biases in have thus to be considered for the interpretation of the results. SATIN is a 41,491 semi-randomly sampled audio recordings out of 40M available on streaming platforms. The sampling of tracks in SATIN has been made in order to retrieve all the tracks that have a validated identifiers link between Deezer, Simbals, and Musixmatch. SATIN is representative in terms of genres and song/instrumental ratio. SATIN is biased towards the mainstream music as the tracks come from Deezer and Simbals. The database does not include independent labels and artists that are available on SoundCloud, for example. The tracks have been recorded in the last 30 years. Finally, SATIN is biased toward English artists because these represent more than one third of the database.
The three experiments of this study show that using every feature at the frame scale increases more the performance than using features at the track scale. In SVD, using frame features leads to Instrumentals misclassification, a high false positive rate, and indecision concerning the presence of singing voice at the frame scale. However, for our task, using the classified frames together can enhance SIC and lead to better results at the track scale. In order to use frame classification to detect Instrumentals, we propose a two-step algorithm. The first step is similar to a regular SVD algorithm because it provides the probability that each frame contains singing voice or not. In the second step, the algorithm uses the previously mentioned probabilities to classify each track as Song or Instrumental. Figure 2 details the underpinning mechanisms for the first step of Instrumental detection, which is a regular SVD method.
Our algorithm extracts the thirteen MFCC after the and the corresponding deltas and double deltas from each 93 ms frame of the tracks contained in . These features are then aligned with a frame ground truth made up by human annotators on the Jamendo database (Ramona et al., 2008), which contains 93 Songs. It is possible to have frame-precise alignments as the annotations provided by Ramona et al. (2008) are in forms of interval in which there is a singing voice or not. As for Instrumentals in
, all extracted features are associated with the tag Instrumental. All these features and ground truths are then used to train a Random Forest classifier. Afterwards, the Random Forest classifier outputs a vector of probability that indicates the likelihood of singing voice presence for each frame.
Now, each track has a probability vector corresponding to the singing voice presence likeliness for each frame. The use of such soft annotations instead of binary ones has shown to improve the overall classification results (Foucard et al., 2012). In the second step, the algorithm computes three sets of features for each track. Two out of three are based on the previous probability vector. The three sets of features generalize frame characteristics to produce features at the track scale. The first set of features is a linear 10-bin histogram ranging from 0 to 1 by steps of 0.1 that represents the distribution of each probability vector. Even if multiple frames are misclassified, the main trend of the histogram indicates that most frames are well classified.
details the construction of the second set of features —named n-gram— that uses the probability vector of singing voice presence.
These song n-grams are computed in two steps. In the first step, the algorithm counts the number of consecutive frames that were predicted to contain singing voice. It then computes the corresponding normalized 30-bin histogram where n-grams greater than 30 are merged up with the last bin. Indeed, chances are that an Instrumental will possess fewer consecutive frames classified as containing a singing voice than a Song. Consequently, an Instrumental can be distinguished from a Song by its low number of long consecutive predicted song frames. By using this whole set of features against such an amount of musical data, we hope to keep "Horses" away (Sturm et al., 2014; Sturm, 2014). Indeed, we increase the probability that our algorithm is addressing the correct problem of distinguishing Instrumentals from Songs because of two reasons. The first reason comes from the use of a sufficient amount of musical data that can reflects the diversity in music. Indeed, our supervised algorithm can leverage instrumentals that contain violin to distinguish this amplitude modulation from the singing voice, for example. This could not have been the case if the musical database was only constituted of rock music, for example. The second reason comes from the features used that have been proven to detect the singing voice presence in multiple track modifications related to the pitch, the volume, and the speed (Bayle et al., 2016). These kinds of musical data augmentation (Schlüter and Grill, 2015) are known to diminish the risk of overfitting (Krizhevsky et al., 2012) and to improve the figures of merit in imbalanced class problems (Chawla, 2009; Wong et al., 2016), thus diminishing the risk of our algorithm being a "Horse".
Finally, the third and last set of features consists of the mean values for MFCC, deltas, and double deltas.
All these features are then used as training materials for an AdaBoost classifier, as described in the following section.
It is necessary to choose a machine learning algorithm that can focus on Instrumentals because these are not well detected and are in minority in musical databases. Thus, we choose to use boosting algorithms because they alter the weights of training examples to focus on the most intricate tracks. Boosting is preferred over Bagging, as the former aims to decrease bias and the latter aims to decrease variance. In this particular applicative context of generating an Instrumental playlist from a big musical database, it is preferred to decrease the bias. Among boosting algorithms, the AdaBoost classifier is known to perform well for the classification of minority tags (Foucard et al., 2012) and music (Bergstra et al., 2006)
. A decision tree is used as the base estimator in Adaboost. The first reason for using decision trees lies in the logarithmic training curve displayed by decision trees and the second reason involves their better performances in the detection of the singing-voice by tree-based classifiers(Lehner et al., 2014; Bayle et al., 2016). We use the AdaBoost implementation provided by the Python package scikit-learn (Pedregosa et al., 2011) to guarantee reproducibility.
This section evaluates the performances of the proposed algorithm in the same experiment as the one conducted in Section 7. We remind the reader that we train our algorithm on the 186 tracks of and test it against the 41,941 tracks of . Our algorithm reaches a global accuracy of and a global f-score of . Table 6 displays the precision and recall of our algorithm for Instrumentals classification and we display once again the previous corresponding results for AllInstrumental, GA, SMVBFF, and VQMM.
As indicated in Table 6, the main difference between our algorithm and GA, SVMBFF, and VQMM comes from the higher precision reached for Instrumental detection. This precision of our algorithm is indeed 0.527 (276.8%) higher than the best existing method —i.e. VQMM— and 0.715 (750.0%) higher than RCA. From a practical point of view, if GA, SVMBFF, and VQMM are used to build an Instrumental playlist, they can at best retrieve 30% of true positive, i.e., Instrumentals, whereas our proposed method increases this number beyond 80%, which is noteworthy for any listeners. The high precision reached cannot be imputed to an over-fitting effect because the training set is 223 times smaller than the testing one. The results from GA, SVMBFF and VQMM might have suffer from over-fitting because their experiment did imply a too restricted music universe (), in terms of size and representativeness of the tracks’ origins. Our algorithm brought the detection of Instrumentals closer to human-performance level than state-of-the-art algorithms.
When applying the same proposed algorithm to Songs instead of Instrumentals, our algorithm reaches a precision of 0.959 and a recall of 0.844 on Song detection, which is respectively 0.07 (7.9%), and 34.4 (68.8%) higher than RCA. In this configuration, the global accuracy and f-score reached by our algorithm are respectively of 0.829 and 0.852.
Just like for VQMM in Fig. 1, we cannot tune our algorithm to guarantee 100% of precision. Our algorithm has only one operating point due to the use of the AdaBoost classifier. We tried to use SVM and Random Forest classifiers — which have multiple operating points — but they cannot guarantee as much precision as AdaBoost did. Our algorithm in its current state performs better in Instrumental detection than state-of-the-art algorithms but it is still impossible to guarantee a faultless playlist. As we aim to reduce the false positives to zero, the proposed classification algorithm seems to be limited by the set of features used. A benchmark of SVD methods (Lukashevich et al., 2007; Ramona et al., 2008; Regnier and Peeters, 2009; Lehner et al., 2014; Leglaive et al., 2015; Lehner et al., 2015; Nwe et al., 2004; Schlüter and Grill, 2015; Schlüter, 2016) is needed to assess the impact of additional features on the precision and the recall when used with our generalization method. Indeed, features such as the Vocal Variance (Lehner et al., 2014), the Voice Vibrato (Regnier and Peeters, 2009), the Harmonic Attenuation (Nwe et al., 2004) or the Auto-Regressive Moving Average filtering (Lukashevich et al., 2007) have to be reviewed.
Apart from benchmarking features, a deep learning approach for SVD has been proposed(Kereliuk et al., 2015; Leglaive et al., 2015; Lehner et al., 2015; Schlüter and Grill, 2015; Lidy and Schindler, 2016; Pons et al., 2016). However, deep learning is still a nascent and little understood approach in MIR242424https://github.com/ybayle/awesome-deep-learning-music and to the best of our knowledge no tuning of the operating point has been performed as it is intricate to analyse the inner layers (Woods and Bowyer, 1997; Zhao et al., 2011). Furthermore, it is intricate to fit the whole spectrograms of full-length tracks of a given musical database into the memory of a GPU and thus it is intricate for a given deep learning model to train on full-length tracks on the SIC task. Current deep learning approaches indeed require to fit into memory batches of tracks large enough —usually 32 (Miron et al., 2017; Oramas et al., 2017)
— to guarantee a good generalisation process. For instance, neural network architecture for SVD algorithms like the one fromSchlüter and Grill (2015) takes around 240MB in memory for 30 seconds spectrograms with 40 frequency bins for each track. This architecture and batch size just fit in a high-end GPU with around 8GB of RAM. To analyse full-length tracks of more than 4 minutes it would require to diminish the batch size below 4 thus decreasing harmfully the model generalization process. This demonstration indicates that creating faultless instrumental playlist with a deep learning approach is not practically feasible now and currently the only solution toward better Instrumental playlists will require to enhance the input feature set of our algorithm.
In this study, we propose solutions toward content-based driven generation of faultless Instrumental playlists. Our new approach reaches a precision of 82.5% for Instrumental detection, which is approximately three times better than state-of-the-art algorithms. Moreover, this increase in precision is reached for a bigger musical database than the ones used in previous studies.
Our study provides five main contributions. We provide the first review of SIC, which is in the applicative context of playlist generation —in Section 3 to 7. We show in Section 8 that the use of frame features outperforms the use of global track features in the case of SIC and thus diminishes the risk of an algorithm being a "Horse". This improvement is magnified when frame ground truths are used alongside frame features, which is the key difference between our proposed algorithm and state-of-the-art algorithms. Furthermore, our algorithm’s implementation can process large musical databases whereas the current implementation of SVMBFF, SRCAM, and VQMM cannot. Additionally, we propose in Section 8 a new track tagging method based on frame predictions that outperforms the Markov model in terms of accuracy and f-score. Finally, we demonstrate that better playlists related to a tag can be generated when the autotagging algorithm focuses only on this tag. This increase is accentuated when the tag is in minority, which is the case for most tags and especially here for Instrumentals.
The source code is available online at https://github.com/ybayle/SMC2017.
Personalized recommendation in social tagging systems using hierarchical clustering.Proc. ACM 2nd Conf. Recomm. Syst., 2008, pp. 259–266.
Automatic outlier detection in music genre datasets.Proc. 17th Int. Soc. Music Inform. Retrieval Conf., 2016, pp. 101–107.
Robust face recognition via sparse representation.IEEE Trans. Pattern Anal. Mach. Intell. 2009, 31, 210–227.
Singing voice detection with deep recurrent neural networks.Proc. 40th IEEE Int. Conf. Acoust. Speech Signal Process., 2015, pp. 121–125.
CQT-based convolutional neural networks for audio scene classification and Domestic Audio Tagging.Proc. IEEE Audio Acoust. Signal Process. Challenge Works. Detect. Classif. Acoustic Scenes Events, 2016, pp. 60–64.