With the rise of online streaming services, it is becoming easier for artists to share their music with the rest of the world. With catalogs that can reach up to tens of millions of tracks, one of the rising challenges faced by music streaming companies is to assimilate ever-better knowledge of their content – a key requirement for enhancing user and artist experience. From a musical perspective, one highly interesting aspect is the detection of composition similarities between tracks, often known as the cover song detection problem. This is, however, a very challenging problem from a content analysis point of view, as artists can make their own version of a composition by modifying any number of elements – instruments, harmonies, melody, rhythm, structure, timbre, vocals, lyrics, among others.
Over the years, it has become customary in the Music Information Retrieval (MIR) literature to address the cover song detection problem in what is arguably the most challenging setting. Indeed, most papers attempt to detect composition relationships between pairs of tracks based on their two audio signals only – in other words, completely out of context and without using any metadata information. While this well-defined task makes sense from an academic perspective, it might not be the optimal approach for solving the problem at an industrial scale .
The second starting point of our work is the fact, often mentioned in cognitive science, that commonly observed patterns are represented and stored in a redundant fashion in the human brain, which makes them more likely to be retrieved, recognised and identified than patterns that are observed less frequently . If true, this would apply to our assessment of composition similarities as well. The main idea behind our work is that the corpus of existing versions of a composition can be precisely a substitute for these multiple representations.
Following these guiding intuitions, we turn to a new use case, where we do not just have pairs but a pool of candidates that are likely to be instances of some given musical work (according e.g. to some first metadata analysis). We then compare these candidates not only to one reference version (e.g. the original track, if it exists) but also to other candidate versions. We then build a graph of all these versions to identify composition clusters. Sometimes, when hundreds or thousands of versions of a given work exist (which is quite common in the catalogue of a streaming company), this ensemble-based approach can result in substantial improvements on the cover detection task.
In Section 2 we present a review of the literature on cover identification. In Section 3, we present the 1-vs-1 cover identification algorithm that we use throughout the paper, which is heavily based on . The main contribution of this paper lies in Section 4, in which we present the new use case for cover identification described in the previous paragraph.We then showcase our method with examples in Section 5 and discuss some challenges in Section 6.
2 Related work
A number of possible approaches for cover song identification have been developed in the last decade, with varying levels of performance. Reference  introduced a first solution to this problem and has been used as a starting point for many subsequent studies. The main idea is to extract a list of beat-synchronous  chroma features from two input tracks and quantify their similarity by applying dynamic programming algorithms to a cross-similarity matrix derived from these features. This algorithm has been refined in  by the same authors by adding a few modifications such as tempo biasing to improve the results. Harmonic Pitch Class Profile (HPCP) features (chroma features) have proven very useful in cover identification [8, 16, 14, 15] as they capture meaningful musical information for composition. Other features have subsequently been introduced, such as self-similarity matrices (SSM) of Mel-Frequency Cepstral Coefficients (MFCC) [19, 18]. To take advantage of the complementary properties of different types of features,  further introduced a method to combine several audio features by fusing the associated cross-similarity matrices, which resulted in a significant increase in performance compared to single-feature approaches.
Having extracted audio features from two tracks to be compared, most methods use dynamic programming (either Dynamic Time Warping or the Smith-Waterman algorithm ) to assign a score to the pair [18, 19, 8]. One drawback of these methods is that they are computationally expensive and cannot be run at scale. Hence other authors have developed solutions that enable cover identification at scale by mapping audio features to smaller latent spaces. For instance, [1, 11]
use Principal Component Analysis (PCA) to compute a condensed representation of audio features which they use to perform a large-scale similarity search (e.g. a nearest neighbor search). In the same vein, references [9, 13]
use deep neural networks to learn low-dimensional representations of chroma features.
3 Pairwise matching
As mentioned above, our ensemble-based cover identification method consists of two steps. For a given work, we proceed to: (i) a pairwise (1-vs-1) comparison of all the tracks in a pool of potential candidates, (ii) a clustering of these candidates based on the results of step (i).
In this section we present the 1-vs-1 cover song identification algorithm (i) which will be used as a starting point for our ensemble-based approach, and evaluate its performance on two distinct cover datasets.
3.1 The algorithm
For the purposes of this work, any 1-vs-1 similarity measure could be used for step (i), as we are mainly interested in quantifying the impact of step (ii) on the overall performance. We have chosen to rely on an implementation of the algorithm introduced in , as the algorithm achieves the best results to date on the Covers80  and MSD (Covers1000) datasets . A high-level overview of the pipeline is shown on Figure 1. As with most algorithms presented in Section 2, it can be decomposed into two stages: first, it extracts a list of meaningful audio features from the two tracks to be compared, then it computes a similarity score based on these. The details of this method are not directly relevant to our work, so we will focus here on a quantitative assessment of its performance, to give the reader a quantitative idea of our starting point. For those interested in the details of how the algorithm works, please refer to .
3.2 Quantitative evaluation of the 1-vs-1 method
We evaluate our implementation of  on two different datasets, and compare it with the numbers reported in the original paper as well as with a publicly available implementation of  by its authors.111https://github.com/ctralie/GeometricCoverSongs To make the comparison more interpretable, we evaluate two versions of our implementation with two sets of parameters: Params1 mimics the parameters used in , and should therefore produce numbers that very similar to those described in the original paper, while Params2 uses shorter 8-beats-long blocks.
We first compare the algorithms on the widely used Covers80 dataset  to enable comparison with other published methods. The dataset is composed of 160 tracks that are divided into two sets (A and B) of 80 tracks each, with every track in set A matching one (and only one) track in set B. For each of the 160 tracks, we compute its score with all the other 159 tracks and report the rank of its true match. Table 1 reports the Mean Rank (MR) of the true match (1 is best), the Mean Reciprocal Rank (MRR) , as well as the Recall@1 (R@1) and Recall@10 (R@10). We also compute the so-called Covers80 scores by querying each track in set A against all the tracks in set B and reporting the number of matches found with rank 1.222Each track from set A is now queried against the 80 tracks from set B, instead of all other 159 tracks. Overall, our results are close to the ones reported in  – even though we could not quite reach the numbers given in their paper.
To complement this baseline, we have created an internal dataset of 452 pairs of covers grouped into several categories, obtained by metadata filtering based on the keywords Acoustic Cover, Instrumental Cover, Karaoke, Live, Remix, Tribute as well as some Classical and Jazz covers. Such granularity allows us to compare the performance of our algorithm across genres and cover types, providing a new perspective on the problem, as shown in Table 2. We have tested the two versions of our algorithm on the 452 positive pairs and 10,000 negative pairs selected uniformly at random. We selected the classification threshold to ensure a very low false positive rate below . Results are presented in Table 1. Our algorithm reaches recall, versus for the publicly available implementation of 333As the computational time is much higher for this algorithm, we only computed the false positive rate using 500 negative pairs.. Note that jazz is the most challenging genre to detect, as jazz covers include a lot of improvisation that can be structurally different from their parent track (see Table 2). If we remove jazz covers from the dataset, the recall increases to with the Params2 implementation.
|# of pairs||57||63||46||57||31||53||77||68|
|Recall||94%||84 %||97%||100%||93 %||96 %||100 %||35 %|
In view of these results, we will use our own implementation with Params2 throughout the rest of this paper, as it is faster and performs best on our internal dataset, which is larger and more diverse than Cover80.
3.3 Distributions of scores
Figure 2 presents the histogram of pairwise scores for all the positive and negative pairs in our internal dataset. The distribution of scores for the negative pairs is short-tailed and tightly concentrated around . This means that above , all the pairs can be matched with high confidence. The distribution of scores for the positive pairs is much wider. As we can see from the histogram, a non-negligible fraction of these pairs lies below the classification threshold (dashed vertical line) and thus cannot be detected with this 1-vs-1 method. The purpose of the next section will be to apply an ensemble method to a pool of candidate versions of a given work, to bring these undetected candidates above the threshold by exploiting the many-to-many relationships between the candidates.
4 Ensemble analysis
While the 1-vs-1 algorithm we presented in Section 3 gives satisfying results overall, it still struggles on covers that are significantly different from their original track. Here we show how analyzing a large pool of candidate covers for one given reference track can improve the quality of the matching. The intuition behind this idea is that a cover version can match the reference track poorly, but match another intermediate version which is closer to the reference. For instance, an acoustic cover can be difficult to detect on a 1-vs-1 basis, but might match a karaoke version which itself strongly matches the reference track. We therefore turn to a new use case, where we not only compare single pairs (e.g. one reference track against one possible cover), but instead start from a pool of candidates that are all likely to be instances of some given composition (or work). Usually, this pool corresponds to candidates that have been pre-filtered according to some non-audio related signal, e.g. their title, and might comprise up to a few thousands candidates, depending on the popularity of the work and the specificity of the pre-filtering step.
4.1 Computing all pairwise scores
Given a set of candidate versions of a work, we first compare all possible pairs of candidates within the set, resulting in distinct scores . As mentioned above, if the candidates have been pre-filtered using some metadata-matching algorithm, typically varies from a few dozen to a few thousand candidates.
4.2 Scores to distances
shows that almost all negative pairs have scores between 0 and 4 while scores above 8 always correspond to positives. Scores above 8 should thus indicate a high probability of a true match regardless of the score, while a variation in score around 4 should have a significant impact on that probability. To account for this fact, we convert our scores into more meaningful distances using a logistic function:, where is the score associated to pair and is the resulting distance. We have found that the values and work well with the distance-collapsing algorithm introduced in the next section.
4.3 Collapsing the distances
Let denote the pairwise distance matrix between all pairs of candidates (see Figure 4, top left). The idea behind the ensemble-based approach is to exploit the geometry of the data to enhance the accuracy of the classification – for example, the fact that a track can match the reference track better through intermediate tracks than directly. We use a loose version of the Floyd-Warshall algorithm  to update the distances in , such that the new distances satisfy the triangular inequality most of the time444The distances that would be obtained by applying the original Floyd-Washall algorithm to would always satisfy the triangular inequality, but the resulting configuration would be very sensitive to outliers. Our method is more robust to outliers, as it requires to find more than one better path to update the distance between two points.
would always satisfy the triangular inequality, but the resulting configuration would be very sensitive to outliers. Our method is more robust to outliers, as it requires to find more than one better path to update the distance between two points.. The method is presented in Algorithm 1.
Here denotes the
smallest value of a vector. We have found that the algorithm is slightly more robust when imposing a penalty for using an intermediate node, which we have set to after performing a grid-search optimization.
Figure 4 shows the distance matrix before (top left) and after (top right) updating the distances using Algorithm 1, for a set of candidates versions of Get Lucky by Daft Punk. We can see that the updated distance matrix has a more neatly defined division between clusters of tracks. The figure shows one large cluster in which all tracks are extremely close to each other (the white area), a few smaller clusters (white blocks on the first diagonal) and a number of isolated tracks that match only themselves.
4.4 Hierarchical clustering
We then proceed to a clustering of the tracks using the updated distance matrix defined in 4.3, denoted . We use hierarchical clustering as we have no prior knowledge on the number of clusters in the graph. Figure 4 (bottom) shows the dendrogram associated with the hierarchical clustering applied to . In this example, if we apply a relatively selective threshold, we find one major cluster (colored in blue in Figure 4) that contains 97% of the true positives and no false positives. Most other clusters contain a single element, which are all the negative tracks and the remaining 3% of the positives. If we set the clustering threshold lower, then we can get more granular clusters within a same work.
4.5 Final score
In order to assign each track a final score that measures its similarity to the reference track, we use the cophenetic distance to the reference track, i.e. the distance along the dendrogram that is produced by the hierarchical clustering. Each track is thus assigned a final score in , simply taken equal to 100 (1 - cophenetic distance), such that exact matches have a score of 100.
5 Analysis of real world examples
We now apply the above to real world data. Our dataset consists of 10 sets of candidates that correspond to 10 works that we want to find the versions of. These 10 works span multiple genres and musical styles, including Hip Hop, R&B, Rap, Pop and Jazz. For a given work, we create the set of candidates by performing a metadata search of the given work’s title on the whole Spotify catalogue. Across the given works that we study, this produces sets of candidates whose sizes vary from a few hundred to a few thousand candidate tracks. Each set includes a reference track, which will be the anchor point for that composition. More details on the dataset can be found in Table 3.
|Work||# tracks||% positives||
|Blurred Lines||386||71%||Robin Thicke|
|Bodak Yellow||110||78%||Cardi B|
|Embraceable You||1319||94%||Sarah Vaughan|
|Get Lucky||657||83%||Daft Punk|
5.2 Outline of the analysis
For each of these works, we analyze the set of candidates following the steps outlined in the previous two sections, providing us with two sets of outputs for each work: (a) the direct score, defined as the output of the 1-vs-1 algorithm between each candidate and the reference track, as described in Section 3 (rescaled between 0 and 100); (b) the ensemble-based score, produced by the method described in Section 4 (also between 0 and 100).
In the next section we start by quantitatively evaluating our ensemble-based approach (b) against the direct approach (a), before turning to some qualitative examples.
5.3 Quantitative results
We define two different metrics to evaluate the direct and the ensemble-based methods:
Ranking metric: For each work, we pick the value of the threshold that minimizes the number of classification errors, and report the number of errors. We call this a ranking metric as the number of errors is minimized when positives and negatives are perfectly ranked, regardless of their scores. We also report the corresponding recall and false positive rates for this threshold.
Classification metric: We fix a universal classification threshold and compute the corresponding number of classification errors.
|Ranking errors - direct|
|Work||Best thr.||False negatives||False positives||Both|
|Bodak Yellow||6.1||6||7.0 %||6||33.3%||12||10.9%|
|Embraceable You||4||0||0%||74||98.7 %||74||5.6 %|
|Ranking errors - ensemble-based|
|Work||Best thr.||False negatives||False positives||Both|
|Embraceable You||40.4||22||1.8 %||19||25.3 %||41||3.1 %|
Table 4 shows the results for the ranking metric for each work in our dataset. For the optimal thresholds, we report the number of false negatives, false positives and the sum of both (i.e. the total number of classification errors). We also compute the corresponding false negative rate, false positive rate and total error rate.
Table 4 shows the ranking results for the direct approach. Interestingly, the number of false negatives tends to be higher than the number of false positives.555Embraceable You is an exception, as its threshold is degenerate and all tracks are classified as matching. This is in line with the histogram in Figure 2, which shows a short-tailed distribution for the negatives and a wider distribution for the positives. Overall, the error rate lies between , corresponding to a recall rate between and and a false positive rate below (except for Bodak Yellow and Embraceable You
which have a very small number of negatives to begin with – the latter case is in fact degenerate as nearly all tracks are classified as matching).
Table 4 shows the ranking results for our ensemble-based approach. The number of ranking errors is substantially lower than for the direct approach, including both the number of false positives and false negatives, as the total error rate goes down below in most cases. Again, the main exception is Bodak Yellow, which has the smallest number of candidates.666It was also a genuinely difficult example and we struggled to annotate it. Embraceable You is the second most challenging work, but remarkably its threshold is no longer degenerate, meaning that the method has now found a way to separate the candidates. Notably, the number of false negatives no longer outnumbers the number of false positives: the ensemble-based approach has successfully caught most of the difficult tracks that poorly matched the reference track. Among the few tracks that are still missed, several are actually very close to the threshold, and only a handful are still completely undetected (cf Table 7).
|Classification errors - direct|
|Work||Thr.||False negatives||False positives||Both|
|Bodak Yellow||12.1||49||57 %||0||0%||49||44.5%|
|Embraceable You||12.1||753||60.5 %||3||4 %||756||57.3 %|
|Classification errors - ensemble-based|
|Work||Thr.||False negatives||False positives||Both|
|Embraceable You||78.8||70||5.6 %||1||1.3 %||71||5.4 %|
Table 5 shows the results for the classification metric. The universal threshold for each approach is defined as the median of the optimal thresholds obtained in the ranking experiment above. Again, we report the number of false negatives, the number of false positives and the sum of both. We also compute the corresponding false negative rate, false positive rate and total error rate. Here again, the results of the ensemble-based approach are overall superior to the direct approach, mostly due to an increase in recall. Although Table 5 is quite similar to Table 4, which is a sign that the threshold on direct scores can be chosen in a nearly universal way, Table 5 differs considerably from Table 4 for some specific works (namely Halo, Imagine and Believer). This happens as the optimal threshold is significantly higher on these works (often ), letting a large number of false positives above the 78.8% threshold.
For each work, we can identify the cases where the ensemble-based approach has allowed us to detect previously undetected tracks, and trace back the optimal path that joined the reference track and the newly found track. Table 6 shows a few examples of such paths for various works. For each example, the reference track is shown at the top of the cell (depth 0), and the newly found track at the bottom of the cell (depth ), with the intermediate tracks that allowed to bridge the gap in between. All the examples are true positives, except for the last example (Halo Halo by Fajters), which has been erroneously matched to a karaoke version of the reference track.
|Work||Depth||Main artist (& link)||
|Imagine||1||Classic Gold Hits||60.0||99.99|
|2||A Perfect Circle||21.6||97.9|
|3||Yoga Pop Ups||8.6||97.9|
|Get Lucky||1||Samantha Sax||40.4||99.95|
|2||Dallas String Quartet||6.9||86.6|
What about the tracks that are still undetected? Table 7 shows examples of tracks that are still undetected by our ensemble-based approach for a couple of works. No clear pattern emerges – apart from the fact that they are often in a very different musical style from the original.
|Work||Main artist - Title||
|Get Lucky||The Getup - Get Lucky||6.4||24.9|
|Halo||Polina Kermesh - Halo||6.3||98.1|
|Amanda Sense - Halo||12.4||94.3|
|Imagine||Dena De Rose - Imagine||10.4||86.2|
|Embraceable||Earl Hines - Embraceable You||9.6||36.3|
|You||Samina - Embraceable You||5.8||17.0|
|Heartless||Bright Light - Heartless||11.9||50.3|
|Rains - Heartless||6.4||47.7|
|Bodak Yellow||Josh Vietti - Bodak Yellow||5.8||14.2|
|J-Que Beenz - Bodak Yellow||5.6||13.0|
|Airplanes||Em Fresh - Airplanes||5.5||66.1|
|Lisa Scinta - Airplanes||9.6||55.0|
One main challenge associated with our ensemble-based approach is how to correctly handle transitivity. This issue emerges from the fact that compositions are not mutually exclusive. For example, a medley might constitute a bridge between two distinct composition groups, which our algorithm would then merge together (which is undesirable). There are probably at least two ways around this issue: one is metadata-based (i.e. identify these potential outliers from the metadata and exclude them from the graph computation), while another is to detect them directly from the graph structure (identify bridges between otherwise unrelated clusters).
In this paper, we have introduced a new formulation of the cover song identification problem: among a pool of candidates that are likely to match one given reference track, find the actual positives. We have introduced a two-step approach, with a first step that computes pairwise similarities between every pair of tracks in the pool of candidates (for which any known 1-vs-1 approach can be used), and a second ensemble-based step that exploits the relationships between all the candidates to output final results. We have shown that this second step can significantly improve the performance compared to a pure 1-vs-1 approach, in particular on the ranking task, where the error rate is down from a few percents to less than 1% in general. The classification task is naturally more challenging as the optimal threshold might vary from work to work, suggesting that the method would be best exploited as a complement to human annotations – where the human’s task would mainly be to find the optimal threshold for the classification. Automating this last step turned out to be non-trivial and is left for future work.
Thierry Bertin-Mahieux and Daniel PW Ellis.
Large-scale cover song recognition using the 2d fourier transform magnitude.In ISMIR, pages 241–246, 2012.
-  Thierry Bertin-Mahieux, Daniel PW Ellis, Brian Whitman, and Paul Lamere. The million song dataset. In Ismir, volume 2, page 10, 2011.
-  Albin Andrew Correya, Romain Hennequin, and Mickaël Arcos. Large-scale cover song detection in digital music libraries using metadata, lyrics and audio features. arXiv preprint arXiv:1808.10351, 2018.
-  Nick Craswell. Mean reciprocal rank. In Encyclopedia of Database Systems, pages 1703–1703. Springer, 2009.
-  Simon Durand, Juan Pablo Bello, Bertrand David, Gaël Richard, Simon Durand, Juan Pablo Bello, Bertrand David, Gael Richard, Bertrand David, Gael Richard, et al. Robust downbeat tracking using an ensemble of convolutional networks. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 25(1):76–89, 2017.
-  Daniel PW Ellis. The "Covers80" cover song data set. URL: http://labrosa. ee. columbia. edu/projects/coversongs/covers80, 2007.
-  Daniel PW Ellis and C Cotton. The 2007 labrosa cover song detection system. MIREX extended abstract, 2007.
-  Daniel PW Ellis and Graham E Poliner. Identifyingcover songs’ with chroma features and dynamic programming beat tracking. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, volume 4, pages IV–1429. IEEE, 2007.
-  Jiunn-Tsair Fang, Chi-Ting Day, and Pao-Chi Chang. Deep feature learning for cover song identification. Multimedia Tools and Applications, 76(22):23225–23238, 2017.
-  Robert W. Floyd. Algorithm 97: Shortest path. Commun. ACM, 5(6):345–, June 1962.
-  Eric J Humphrey, Oriol Nieto, and Juan Pablo Bello. Data driven and discriminative projections for large-scale cover song identification. In ISMIR, pages 149–154, 2013.
-  Ray Kurzweil. How to create a mind: The secret of human thought revealed. Penguin, 2013.
-  Xiaoyu Qi, Deshun Yang, and Xiaoou Chen. Audio feature learning with triplet-based embedding network. In AAAI, pages 4979–4980, 2017.
-  Suman Ravuri and Daniel PW Ellis. Cover song detection: from high scores to general classification. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on, pages 65–68. IEEE, 2010.
-  Joan Serra, Emilia Gómez, and Perfecto Herrera. Transposing chroma representations to a common key. In IEEE CS Conference on The Use of Symbols to Represent Music and Multimedia Objects, pages 45–48, 2008.
-  Joan Serra, Emilia Gómez, Perfecto Herrera, and Xavier Serra. Chroma binary similarity and local alignment applied to cover song identification. IEEE Transactions on Audio, Speech, and Language Processing, 16(6):1138–1151, 2008.
-  Temple F. Smith and Michael S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147(1):195–197, 1981.
-  Christopher J Tralie. Early MFCC and HPCP fusion for robust cover song identification. arXiv preprint arXiv:1707.04680, 2017.
-  Christopher J Tralie and Paul Bendich. Cover song identification with timbral shape sequences. arXiv preprint arXiv:1507.05143, 2015.