I Introduction
journalInpreprintIn the field of music content analysis, quantifying similarity between audio signals has received a substantial amount of interest [1]. Owing to the proliferation of music in digital formats, there exists potential for applications using music similarity techniques, in a wide range of domains. At the level of individual tracks, these domains span audio fingerprinting [2], cover song identification [3], artist identification [4, 5] and genre classification [6]. Applications can be distinguished according to their degree of specificity [1], referring to the level of granularity required for retrieving audio tracks from a collection, given a query track. For example, in audio fingerprinting, the required specificity is high, since the set of possible tracks corresponding to a particular recording is typically small, in relation to the data set. In contrast, genre classification requires low specificity, since the set of tracks sharing a common genre is potentially large, in relation to the data set.
A cover song may be defined as a rendition of a previously recorded piece of music [7]. Cover song identification is deemed to have midlevel, diffuse specificity, since cover songs may differ from the original song in various musical facets, including rhythm, tempo, melody, harmonisation, instrumentation, lyrics and musical form. Correspondingly, cover song identification remains a challenging problem [3].
In this work, we investigate methods for cover song identification that are based on quantifying pairwise predictability between sequences. From a musicpsychological perspective, the significance of intrinsic predictability in musical sequences has been reflected on by Meyer [8], who considers the possibility of using Shannon’s information theory [9] to quantify predictive uncertainty. Statistical learning is implicated in forming musical expectations [10]; a successful approach to modelling expectations in response to an unfolding stream of musical events involves estimating sequential statistical models and computing informationtheoretic measures of predictive uncertainty [11]. As exemplified in [12], an informationtheoretic approach admits a rich conceptual framework for quantifying predictive uncertainty in musical sequences. For our own purposes in cover song identification, we seek to establish if an informationtheoretic approach might be useful for determining pairwise similarity between tracks.
Based on our previous work [13]
, we consider an informationtheoretic approach to quantifying similarity between feature vector sequences. One possible approach based on the nonShannon information measure of Kolmogorov complexity
[14], the normalised compression distance (NCD) [15], quantifies similarity between two strings in terms of joint compressibility. The NCD has been applied successfully across a range of problem domains [15, 16, 17, 18], including music content analysis [19, 20, 21, 22, 23, 24]. For our chosen task of cover song identification, we interpret the NCD as a measure of pairwise predictability. Using our informationtheoretic framework, we compare the NCD to alternative predictability measures based on Shannon information. We provide an evaluation of competing informationtheoretic approaches and identify issues concerning their implementation. This paper extends our previous work [13] as follows: Firstly, we examine a larger set of distance measures and estimate distance measures by predicting discretevalued sequences. Further, we incorporate the Million Song dataset (MSD) [25] into our evaluations. Finally, we investigate combining distance measures using both our considered datasets.The remainder of this paper is organised as follows: Section II discusses related work on audiobased cover song identification and methods for determining musical similarity. Section III introduces the pairwise similarity methods evaluated in this work. Section IV describes our experimental procedure. Finally, in Sections V and VI we present results and conclusions.
Ii Related Work
Iia Musical Similarity
Methods for characterising similarity between sequences of audio features can be distinguished based on whether the temporal order of features is discarded or retained [1]. In the former socalled ‘bagoffeatures’ approach, a widespread method involves estimating distributions of features obtained from timefrequency representations of musical audio [26, 27, 28, 29, 5, 30, 31]. The bagoffeatures approach is unable to model the temporal aspect of music, in which rhythmic, harmonic and melodic objects exhibit sequential structure and in which repetition and variation are of importance [32]. Casey and Slaney [33] emphasise the role of sequences for music similarity applications, whereas Aucouturier et al. [29] discuss the relative limitations of the bagoffeatures approach in a comparison of musical and nonmusical audio modelling. Sequential approaches have been utilised in music structure analysis, for identifying repeated and contrasting sequences and their boundaries within a single piece of music [34], in addition to cover song identification.
IiB Cover Song Identification
Owing to the importance of tonal content in determining whether a song is a cover of another, recent cover song identification approaches typically extract representations of the tonal content using chroma features [35, 36]. Chroma features quantify energy distributions across octavefolded bands, using pitch classes in the chromatic scale to map frequency bands to chroma bins.
A variety of cover song identification approaches are based on aligning feature sequences. A widespread approach involves using dynamic programming to determine an optimal set of feature vector insertions, deletions and substitutions, obtained from a similarity matrix. Following Foote’s [37] method of applying dynamic time warping (DTW) to a similarity matrix constructed from spectral energy features, Gómez and Herrera [38] propose a DTW approach using chroma features. Serrà et al. [7] propose to compute binarised similarity matrices, substituting DTW with an alternative local alignment approach. The crossrecurrence approaches proposed by Serrà et al. [39] extend the notion of similarity matrices considered in the preceding investigations, in that timelagged chroma vectors are combined to form higherdimensional temporal features. In an alternative approach, Serrà et al. [40] utilise the previously described method of representing chroma features in combination with nonlinear time series prediction techniques, using the crossprediction error as a measure of similarity.
Using a signal processing approach, Ellis and Poliner [41] determine componentwise crosscorrelation maxima as a measure of similarity between chroma features. Jensen [42] computes the Euclidean distance between twodimensional autocorrelations of chroma sequences. More recently, BertinMahieux [43]
proposes a keyinvariant approach based on applying the twodimensional Fourier transform to chroma sequences.
An alternative approach involves computing similarities between discretevalued representations of musical content. Tsai et al. [44] apply DTW to discretevalued sequences, using spectral peakpicking for predominant melody extraction. Bello [45] and Lee [46]
perform chord estimation with hidden Markov models, using mappings of model states to chords. The resulting sequences are then aligned using DTW. Martin et al.
[47]heuristically select chroma bin maxima to determine triads, before locally aligning sequences. We may consider DTWbased approaches, the stringbased heuristic evaluated in [47] and the crosscorrelation approach evaluated in [41] as alignment techniques, in the sense that they may be used to maximise pairwise correlation between sequences.With particular regard to this work, a number of approaches are based on applying the NCD to discretevalued sequences. Using symbolic musical representations directly, Cilibrasi et al. [20]
apply hierarchical clustering to pairwise distances between pieces of music, performing an analysis of clusters with respect to musical genres, musical works and artists. Li and Sleep apply the NCD to genre classification of symbolic musical representations
[19] and musical audio [21].For audiobased cover song identification, Ahonen [23] obtains discretevalued representations of framebased chroma features by applying a hidden Markov model (HMM) to perform chord transcription. Predicted chord sequences are then converted to a differential representation, before computing pairwise distances between tracks using the NCD based on different compression algorithms. Ahonen [48] further proposes to compute multiple discretevalued representations using additional HMMs and by computing chroma differentials, before averaging separately obtained pairwise distances using the NCD based on prediction by partial matching (PPM) [49]. In addition, Ahonen [50] investigates chromaderived representations which are compressed using BurrowsWheeler (BW) compression [51]. Bello [24] applies the NCD to recurrence plots computed on individual tracks, as a measure of structural similarity between pieces of music. Finally, Tabus et al. [52] proposes a similar approach to Ahonen based on quantising chromaderived representations, observing that an alternative compressionbased similarity measure outperforms the NCD. Additionally, Silva et al. [53] propose a measure of structural similarity based on video compression, observing superior performance using an alternative compressionbased measure. Our work extends the above investigations, in that we examine and propose the use of alternative informationtheoretic similarity measures to the NCD. Furthermore, we perform an extensive comparison of methods for estimating the NCD and related similarity measures, while proposing approaches which do not require quantising audio features.
A number of recent investigations are concerned with cover song identification using largescale music collections containing millions of tracks. For such collections, it is typically infeasible to perform computationally expensive pairwise comparisons between a query and every track in the collection. Casey et al. [54] compute Euclidean distances between windowed chroma sequences. Pairwise similarity is then quantified as the number of distances falling below a threshold. Such an approach may be combined with localitysensitive hashing [55] for retrieval with sublinear time complexity, with respect to a single query. Using a similar approach, BertinMahieux and Ellis [56] propose to identify salient ‘landmark’ chroma vectors in individual tracks by applying a thresholding scheme. Identified landmark vectors are then encoded as an integer, thus the collection may be represented as a lookup table. Given a query, the same authors envisage that obtained results are reranked using a computationally expensive approach, as proposed by Khadkevich and Omologo [57]. In this work, we apply such a filterandrefine approach [58], using informationtheoretic similarity measures in the refinement stage.
IiC InformationTheoretic Methods
Informationtheoretic similarity measures between time series have been proposed in a variety of domains. The idea of jointly compressing two discretevalued sequences is due to Loewenstern et al. [59] in the context of nucleotide sequence clustering. By parsing sequences using the LempelZiv (LZ) algorithm [60], Ziv and Merhav [61] propose a method for comparing sequences by compressing one sequence using a model estimated on the other sequence. An alternative approach is considered by Benedetto et al. [62] for building language trees, where sequences are jointly compressed. Cilibrasi et al. [63] motivate their approach of jointly compressing sequences as an approximation of the normalised information distance [15].
Iii Approach
We denote with , two multivariate time series, each representing a sequence of feature vectors extracted from a piece of musical audio. If we assume that both , consist of independent and identically distributed realisations generated respectively by stochastic processes , , one possible means of quantifying dissimilarity between time series involves the KullbackLeibler (KL) divergence, defined as
(1) 
where ,
denote the probability density of observation
emitted by , , respectively. Viewed in terms of Shannon information and taking the logarithm to base 2, recall that the KL divergence quantifies the expected number of additional bits required to represent observations emitted by information source , given an optimal code for observations emitted by information source . The KL divergence has been widely used in conjunction with a ‘bagoffeatures’ approach for lowspecificity music content analysis tasks [1].To account for temporal structure in musical audio, we may use the NCD as a measure of musical dissimilarity between sequences of quantised feature vectors [21, 23, 52]. Given two strings , , the NCD is defined as
(2) 
where denotes the number of bits required to encode a given string, using a compressor such as the LZ compression algorithm [60]. Similarly, denotes the number of bits required to encode the sequential concatenation of strings , . The NCD is an approximation of the normalised information distance (NID) [15], defined as
(3) 
where the uncomputable function denotes algorithmic information content (AIC), also known as Kolmogorov complexity. The AIC of a given string is the length in bits of the shortest program which outputs the string and then terminates [14]. Similarly, denotes the length of the shortest program which outputs , , in addition to a means of distinguishing between both output strings [14]. Thus, AIC quantifies the number of bits required to represent specified input strings, under maximally attainable compression. Furthermore, the NID characterises dissimilarity using the transformation under which input strings most closely resemble each other [15].
We are interested in examining the performance of the NCD as an approximation of the NID, where the choice of compressor determines the feature space used to compute similarities [64] in the NCD. Furthermore, note that the choice of sequential concatenation in to approximate represents an additional heuristic [15]. In the following sections, we describe our contribution: We first consider in Section IIIA the NID from the perspective of Shannon information, using which we propose a modification to the NCD in Section IIIB. We then propose alternative predictionbased measures of similarity in Section IIIC. We detail our approach of applying such measures to continuousvalued sequences in Section IIID.
Iiia Quantifying Time Series Dissimilarity Using Shannon Information
We approach the problem of quantifying dissimilarity from the perspective of Shannon information. We assume finiteorder, stationary Markov sources , . We denote with
the sequence of discrete random variables
emitted by source at times . We denote with , , the entropy rate, joint entropy rate and conditional entropy rate, respectively defined as(4) 
(5) 
(6) 
The entropy rate defined in (4) quantifies the average amount of uncertainty about , while accounting for dependency between for all . Analogously, the joint entropy rate defined in (5) quantifies the average amount of uncertainty about the pair emitted by sources , while in addition accounting for correlation between the sources. For the conditional entropy rate we have
(7)  
(8) 
From (8) we may interpret as quantifying the average amount of uncertainty about a given emission , while taking into account dependency between observations emitted by and given knowledge of observations emitted by .
For observations emitted from source , up to an additive constant the expectation may be approximated using the entropy [65],
(9) 
Using (4), (5), we assume further approximations
(10) 
(11) 
where denotes the expected value of for observations emitted from sources . In terms of Shannon information, following [66] we use (6) and estimate the NID as
(12) 
IiiB Normalised Compression Distance with Alignment
As given in (12), the NID utilises the joint entropy rate , which accounts for correlation between sources. In contrast, the approach of compressing sequentially concatenated strings to estimate may be inadequate for compressors based on Markov sources, since correlation is not accounted for [66]. To address this possible limitation, we propose the normalised compression distance with alignment (NCDA), defined as
(13) 
where performs alignment as a means of maximising correlation between integervalued strings
. We generate equallength strings by padding the shorter of the two strings with the most common value of the longer string. Then, we determine the lag which maximises crosscorrelation between strings, before circularly shifting
using the obtained lag value. Finally, we interleave strings. We motivate our choice of crosscorrelation by considering that crosscorrelation may be computed efficiently, as a series of inner products. Hence, our choice of crosscorrelation is pragmatic; an alternative approach might involve minimising NCDA with respect to all lags, or aligning strings using an alternative algorithm.IiiC Predictive Modelling
As previously described, the NCD and NCDA rely on determining the number of bits required to encode strings, using a specified compression algorithm. As an alternative approach, we consider the relation between predictability and compressibility [67, 68] and perform sequence prediction. We illustrate our approach for the case of discretevalued observations. First, recall that the entropy rate is given as
(14) 
where denotes the probability of observing , with according to the alphabet . We may interpret the quantity as the number of bits required to represent , assuming an optimal code. thus quantifies the expected number of bits required to represent a single observation emitted by , while accounting for dependency between observations. Assume that we have an empirical estimate of the distribution , based on finite observations . Following [69], we estimate using average logloss , defined as
(15)  
(16) 
where denotes the estimated probability of observing , given preceding context . Using (16), we thus compute average logloss by evaluating the likelihood of observations under the estimated distribution , which we may conceive of as performing a series of predictions based on increasingly long contexts . Since is an estimate of , the described process is termed selfprediction [40].
We denote with the probability of observing from source . A measure of disparity between sources is the cross entropy rate ,
(17) 
quantifying the expected number of bits required to represent observations emitted by source , given an optimal code for source . We estimate by computing the average logloss based on iterated prediction, where denotes an estimate of based on observations . Since , represent disparate sources, the described process is termed crossprediction [40]. Analogous to the NCD, as a symmetric distance between sources based on cross entropy, we compute the quantity
(18) 
where in (18) the denominator serves as a normalisation factor, analogous to the denominator in (2) and where we use selfprediction to estimate .
To obtain a predictionbased estimate of the NID in (12), we may estimate , again using selfprediction. Furthermore, we estimate the conditional entropy rate using the distribution , referring to the estimated distribution of observations emitted by , given knowledge of observations emitted by . Analogous to selfprediction and crossprediction, we define the quantity ,
(19) 
We refer to the process used to compute (19) as conditional selfprediction.
preprint journal
(a) Crossprediction
preprint journal
(b) Selfprediction
preprint journal
(c) Conditional selfprediction
IiiD ContinuousValued Approach
The quantities described in Section IIIC may be computed using quantised feature vectors [21, 23, 31, 52]. As an alternative, we propose an approach requiring no prior quantisation. As used in [40], in our approach we utilise nonlinear time series prediction. In contrast to [40], we are concerned with evaluating distance measures which we compute as statistics of prediction errors. Therefore, we use a comparatively straightforward nearestneighbours approach. Given the sequence of feature vectors , consider first the process of timedelay embedding [70], which yields the vector sequence , whose elements are defined as
(20) 
According to (20), each element aggregates feature vector along with its preceding temporal context . The amount of temporal context is controlled by parameters , , respectively referred to as embedding dimension and time delay. Operator denotes vectorisation.
Our method of predicting features is based on determining nearest neighbours in timedelay embedded space. We first illustrate our method for the case of crossprediction, depicted schematically in Fig. IIIC (a). Given sequence , we denote with the estimated successor of sequence ,
(21) 
where denotes the predictive horizon (how far into the future we predict), and where we define as
(22) 
with denoting the sample Pearson correlation coefficient between vectors . We motivate use of correlation coefficients as an alternative to the Euclidean distance, following [71].
Depicted schematically in Fig. IIIC (b), to perform selfprediction we set . Since features may be slowlyvarying, when forming prediction we disregard observations in the immediate past of time step . Thus we define
(23) 
with defined as
(24) 
and where denotes the radius below which observations are disregarded.
Finally, to perform conditional selfprediction, we use both timedelay embedded spaces , . Given predictions , , respectively obtained using crossprediction and selfprediction, we compute the linear combination
(25) 
Similar to the approach given in [72], in (25) for weighting coefficient we use
(26) 
where , respectively denote crossprediction and selfprediction mean squared errors. Fig. IIIC (c) depicts conditional selfprediction schematically.
Given the sequence of predictions , we denote with the rescaled prediction error, whose th component is given by
(27) 
where
denotes the sample variance of the
th component in . We contrast our approach with the componentwise normalised mean squared error (NMSE) based on crossprediction used in [40], which may be applied as an alternative measure of dissimilarity between time series. Our approach is based on assuming that the prediction error may be represented using a normally distributed random variable
with samples . Using the samples, we estimate the prediction error entropy parametrically. In the case of selfprediction, we assume the approximation ; analogously in the case of crossprediction and conditional selfprediction, we assume respective approximations , . Assuming normality, we estimate using the equation(28) 
where denotes the sample covariance of . In our continuousvalued approach, using the prediction methods depicted in Fig. IIIC, we thus estimate informationbased measures as statistics of the prediction error sequence. We then substitute the obtained quantities in (12) and (18) to obtain continuousvalued, predictionbased analogues of the NID and distance . The continuousvalued, predictionbased approach contrasts with our discretevalued, predictionbased methods previously described in Section IIIC and our discretevalued, compressionbased method described in Section IIIB.
Iv Experimental Method
We first evaluate our proposed methods using a set of audio recordings of Jazz standards ^{1}^{1}1http://www.eecs.qmul.ac.uk/~peterf/jazzdataset.html. We assume that two tracks are a cover pair if they possess identical title strings. Thus, we assume a symmetric relation when determining cover identities. The equivalence class of tracks deemed to be covers of one another is a cover set. The Jazz data set comprises 97 cover sets, with average cover set size 3.06 tracks.
Furthermore, we perform a largescale evaluation based on the MSD [25]. This dataset includes metadata and precomputed audio features for a collection of Western popular music recordings. We use a predefined evaluation set of 5 236 query tracks partitioned into 1 726 cover sets ^{2}^{2}2http://labrosa.ee.columbia.edu/millionsong/secondhand, with average cover set size 3.03 tracks. Following [43], for each query track, we seek to identify the remaining cover set members contained in the entire track collection.
Iva Feature Extraction
For the Jazz dataset, as a representation of musical harmonic content, we extract 12component beatsynchronous chroma features from audio using the method and implementation described in [41]. Assuming an equaltempered scale, the method accounts for deviations in standard pitch from 440Hz, by shifting the mapping of FFT bins to pitches in the range of semitones. Following chroma extraction, beatsynchronisation is achieved using the method described in [73]. First, onset detection is performed by differencing a logmagnitude Melfrequency spectrogram across time and applying halfwave rectification, before summing across frequency bands. After highpass filtering the onset signal, a tempo estimate is formed by applying a window function to the autocorrelated onset signal and determining autocorrelation maxima. Varying the centre of the window function allows tempo estimation to incorporate a bias towards a preferred beat rate (PBR). The tempo estimate and onset signal are then used to obtain an optimal set of beat onsets, by using dynamic programming. Chroma features are averaged over beat intervals, before applying squareroot compression and normalising chroma features with respect to the Euclidean norm. Based on our previous work [13], we evaluate using a PBR of 240 beats per minute (bpm).
The MSD includes 12component chroma features alongside predicted note and beat onsets [74], which we use in our evaluations. In contrast to the beatsynchronous features obtained for the Jazz dataset, MSD chroma features are initially aligned to predicted onsets. Motivated by our choice of PBR for the Jazz dataset, we resample predicted beat onsets to a rate of 240bpm. We then average chroma features over resampled beat intervals. Finally, we normalise features as described for the Jazz dataset.
IvB Key Invariance
To account for musical key variation within cover sets, we transpose chroma sequences using the optimal transposition index (OTI) method [7]. Given two chroma vector sequences , , we form summary vectors , by averaging over entire sequences. The OTI corresponds to the number of circular shift operations applied to which maximises the inner product between and ,
(29) 
where denotes applying circular shift operations to . We subsequently shift chroma vectors by positions, prior to pairwise comparison.
IvC Quantisation
For discretevalued similarity measures, we quantise chroma features using the means algorithm. We cluster chroma features aggregated across all tracks, where we consider codebook sizes in the range . To increase stability, we execute the means algorithm times. We then select the clustering which minimises the mean squared error between data points and assigned clusters. The described quantisation method performs similarly to an alternative based on pairwise sequence quantisation; for a detailed discussion we refer to our previous work [13].
IvD Distance Measures
We summarise the distance measures evaluated in this work in Table I, where for each distance measure, we list our estimation methods.
We utilise the following algorithms to compute distance measures by compressing strings: Prediction by partial matching (PPM) [49], BurrowsWheeler (BW) compression [51] and LempelZiv (LZ) compression [60], implemented respectively as PPMD ^{3}^{3}3http://compression.ru/ds/, BZIP2 ^{4}^{4}4http://bzip2.org and ZLIB ^{5}^{5}5http://zlib.org. In all cases, we set parameters to favour compression rates over computation time. To obtain strings, following quantisation we map integer codewords to alphanumeric characters.
We use the described compression algorithms to determine the length in bits of compressed strings and compute NCD, NCDA distances. In a complementary discretevalued approach, we use string prediction instead of compression. Using average logloss, we compute NCDA using the formula
(30) 
where is the average logloss obtained from performing selfprediction on the aligned sequence . We compute a predictionbased variant of NCD analogously by predicting sequentially concatenated strings without performing any alignment. In addition, we use crossprediction to estimate distance measure , as defined in (18). We perform string prediction using Begleiter’s [69] implementations of PPMC and LZ78 algorithms.
Note that the KL divergence given in (1) is nonsymmetric. In our evaluations, we observed that computing a symmetric distance improved performance; based on KL divergence, we compute the JensenShannon divergence (JSD) , defined as
(31) 
where denotes the mean of ,
(32) 
As a baseline method, we compute the JSD between symbol histograms normalised to sum to one.
We evaluate continuousvalued prediction using timedelay embedding parameters , , , setting the exclusion radius in (24) to based on preliminary analysis using separate training data. We compute distance measure using crossprediction to estimate the numerator in (18). In a complementary approach, we estimate the NID using conditional selfprediction to estimate the numerator in (12). For and NID, we use selfprediction to estimate the denominator in (18), (12), respectively.
Finally, to compensate for cover song candidates consistently deemed similar to query tracks, we normalise pairwise distances using the method described in [75]. We apply distance normalisation as a postprocessing step, before computing performance statistics.
Distance journalmeasure  Definition  Estimation method  

Eqn. 2  String compression (LZ, BW, PPM)  journalDiscrete prediction (LZ, PPM)  
preprint Discrete prediction (LZ, PPM)  
Eqn. 13  String compression (LZ, BW, PPM)  journalDiscrete prediction (LZ, PPM)  
preprint Discrete prediction (LZ, PPM)  
Eqn. 18  Discrete prediction  journalContinuous prediction  
preprint Continuous prediction  
Eqn. 31  Normalised symbol histograms  
Eqn. 12  Continuous prediction 
IvE Largescale Cover Song Identification
For music content analysis involving large datasets, algorithm scalability is an important issue. The approaches in this work by themselves require a linear scan through the dataset for a given query, which may be infeasible for large datasets. We use a scalable approach for our evaluations involving the MSD. Following [57] and similar to the method proposed in [76], we incorporate our methods into a twostage retrieval process. By using a metric distance to determine similarity in the first retrieval stage, we allow for the potential use of indexing or hashing schemes, as proposed in [54, 58]. We then apply nonmetric pairwise comparisons in the second retrieval stage.
In the first stage, we quantise as described in Section IVC and represent each track with a normalised codeword histogram. Given a query track, we then rank each of the candidate tracks using the L1 distance. To account for key variation, for each candidate track we minimise L1 distance across chroma rotations. We then determine the top candidate tracks, which we rerank in the second stage using our proposed methods. After both retrieval stages, we normalise pairwise distances as described in Section IVD. We report performance based on the final ranking of all candidate tracks, across query tracks.
IvF Performance Statistics
As used in [24], we quantify cover song identification accuracy using mean average precision (MAP), based on ranking tracks according to distance with respect to queries. The MAP is obtained by averaging querywise scores, where we may interpret each score as the average of precision values at the ranks of relevant tracks, where relevant tracks in our case are covers of the query track. Following [24], we use the Friedman test [77] to compare accuracies among distance measures. The Friedman test is based on ranking across queries each distance measure according to average precision. We combine the Friedman test with Tukey’s range test [78]
to adjust for Type I errors when performing multiple comparisons.
As a subsidiary performance measure, for each query we compute the precision at rank , with . We subsequently average across queries to obtain mean precision at rank .
IvG Combining Distance Measures
To determine if combining distance measures improves cover song identification accuracy, we obtain pairwise distances as described in Section IVD. We denote with the pairwise distance between the th query track and the th result candidate, obtained using the th distance measure in our evaluation. We transform by computing the inverse rank ,
(33) 
where denotes the rank of among all distances obtained with respect to query track , given the
th distance measure. We apply this transformation to protect against outliers, while ensuring that distance decreases rapidly for track pairs deemed highly similar, for decreasing distance. Note that since our distance transformation preserves monotonicity and MAP itself is based on ranked distances, performance of unmixed distance measures is uninfluenced by this transformation. Finally, we combine distances
, by computing a weighted average of distances pooled using and operators,(34) 
where we vary in the range . We motivate our approach on the basis that we may interpret inverse ranks as estimated probabilities of cover identities, furthermore the operators and have been proposed as a means of combining probability estimates for classification [79]. In forming a linear combination, we evaluate the utility of pooling versus pooling. An alternative approach based on straightforward averaging did not yield any performance gain.
IvH Baseline Approaches
In addition to the JSD and crossprediction NMSE baselines, we include an evaluation of the method and implementation described in [41] based on crosscorrelation. As a random baseline, we sample pairwise distances from a normal distribution.
V Results
Va DiscreteValued Approaches Based on Compression
In Fig. 2 (a)–(c), we examine the performance of discretevalued NCD and NCDA distance measures, combined with LZ, BW and PPM algorithms and based on the Jazz dataset. For the LZ algorithm, NCDA yields a relative performance gain of %, averaged across codebook sizes. In contrast, for PPM, with the exception of small codebook sizes in the range , NCDA yields no consistent improvement over NCD, however averaged across codebook sizes we obtain a mean relative performance gain of %. Finally, the effect of using NCDA is reversed for BW compression, where performance decreases by an average of %.
Examining results for the MSD in Fig. 2 (e)–(g), we observe similar qualitative results for LZ and BW algorithms. For the LZ algorithm, NCDA yields an average relative performance gain of %, whereas for BW compression we observe an average relative performance loss of %. In contrast to the Jazz dataset, for PPM we observe an average relative performance loss of %.
For both datasets, NCDA appears to be most advantageous combined with LZ compression, whereas BW yields the least advantageous result. Note that BW compression is blockbased in contrast to LZ and PPM compressors, both of which are sequential. We attribute this observation to performance differences among compressors, since the assumptions made in Section IIIB rely on assuming Markov sources. Noting differences in relative performance gains between datasets, following [57] we further conjecture that chroma feature representation influences the performance of the evaluated distance measures.
We examine the performance of JSD between normalised symbol histograms, as displayed in Fig. 2 (d), (h). Surprisingly, for the Jazz dataset and for , JSD outperforms compressionbased methods, with maximum MAP score obtained for . This result is contrary to our expectation that NCD approaches should outperform the bagoffeatures approach, by accounting for temporal structure in time series. In contrast, for the MSD and for optimal , both NCD and NCDA outperform JSD across all evaluated compression algorithms. We attribute this disparity to differences in dataset size, where for the Jazz dataset the problem size may be sufficiently small to amortise advantages of using NCD, NCDA compared to JSD.
VB DiscreteValued Approaches Based on Prediction
In Fig. 3, we consider the performance of distance measures based on string prediction. For the Jazz dataset, comparing logloss estimates of NCD and NCDA using the LZ algorithm, averaged across codebook sizes NCDA outperforms NCD; we obtain a mean relative performance gain of (Fig. 3 (a)). For the PPM algorithm, although NCD maximises performance (MAP ), we obtain a mean relative performance gain of % using NCDA over NCD (Fig. 3 (b)). Importantly, for both LZ and PPM the crossprediction distance consistently outperforms NCD and NCDA; for and combined with PPM compression, we obtain MAP . For the MSD and using LZ compression, in contrast to the Jazz dataset we observe a mean relative performance loss of % when comparing with NCDA. For both LZ and PPM, NCDA compared to NCD yields mean relative performance gains of % and %, respectively.
VC ContinuousValued Approaches
Table II displays the performance of continuousvalued prediction approaches. Note that for , parameter may be set to an arbitrary integer following (20). We consider results obtained for the Jazz dataset (Table II (a)–(c)). Using conditional selfprediction to estimate the NID, maximised across parameters we obtain MAP . In comparison, crossprediction distance yields MAP . As a baseline, we determine the crossprediction NMSE, where maximising across parameters we obtain MAP . Table II (a)–(c) displays performance against evaluated parameter combinations. Examining results for the MSD in Table II (d)–(f), we obtain qualitatively similar results with maximum MAP values , and for NID, and NMSE, respectively. For both datasets, we observe that increasing the value of consistently improves performance. In contrast, we observe no such effect for parameters .
VD Summary of Results and Comparison to State of the Art
Fig. 4 (a), (b) displays the result of significance testing as described in Section IVF
, where we assume 95% confidence intervals and where we maximise across evaluated parameter spaces. Table
III displays a corresponding summary of MAP scores. As baselines we include Ellis and Poliner’s crosscorrelation approach [41], in addition to randomly sampled pairwise distances. For the MSD, when used without any further refinement method, our filtering stage based on normalised codeword histograms yields MAP 0.0056.For both Jazz dataset and MSD, we observe that continuousvalued approaches based on crossprediction consistently outperform discretevalued approaches. Moreover, with the exception of NCD combined with PPMbased string compression and for the MSD, using continuousvalued crossprediction significantly outperforms discretevalued approaches. For approaches based on string compression, we note that using NCDA with BW compression significantly decreases performance with respect to NCD. Similarly, using NCDA decreases MAP scores for PPM. Although we do not observe a significant performance gain using NCDA over NCD for LZ compression, performance improves consistently across datasets. For the Jazz dataset, we observe that the JSD baseline significantly outperforms the majority of stringcompression approaches. In contrast, for the MSD the majority of stringcompression approaches significantly outperform the JSD baseline. Whereas PPM with distance consistently outperforms all discretevalued approaches for the Jazz dataset, PPM with compressionbased NCD consistently outperforms all discretevalued approaches for the MSD and significantly outperforms the JSD baseline.
In a comparison of continuousvalued approaches, we observe that crossprediction using either distance or NMSE competes with crosscorrelation for the Jazz dataset. In contrast, the same crossprediction approaches significantly outperform crosscorrelation for the MSD.
Examining continuousvalued approaches further, for both Jazz dataset and MSD, we observe a significant disadvantage in using our conditional selfprediction based estimate of NID, over crossprediction based distances and NMSE. The relatively poor performance of NID for the MSD suggests a limitation of our prediction approach when used with MSD chroma features. However, considering results for both datasets suggests that crossprediction yields more favourable results than conditional selfprediction generally.
To facilitate further comparison, we consider the approaches proposed by BertinMahieux and Ellis [43], Khadkevich and Omologo [57], who report MAP scores of 0.0295, 0.0371, respectively. Based on such a comparison, we obtain stateoftheart results. Note that the stated approaches do not report any distance normalisation procedure as described in Section IVD; we found that normalisation improves our results: For the Jazz dataset and using unnormalised distances, we obtain MAP scores , , for NMSE, , NID, respectively. For the MSD and using unnormalised distances, we obtain MAP scores , , , for NMSE, , NID, respectively.
VE Combining Distances
Finally, using the method described in Section IVG, we combine distances obtained using continuousvalued prediction. Fig. 5 displays MAP scores against mixing parameter , for Jazz dataset and MSD. We consider the combinations , , the latter combination which we evaluate with respect to optimal for the former combination.
Compared to using the baseline NMSE alone, across all and for both datasets we observe that combining NMSE with improves performance: For the Jazz dataset, we observe maximal MAP score 0.496, corresponding to a gain of %. For the MSD, we observe maximal MAP score , corresponding to a gain of . We observe no performance gain by further combining NID estimates with NMSE and , obtaining maximal MAP scores 0.432 and 0.0463 respectively for Jazz dataset and MSD. Additional evaluations revealed no performance gain using unnormalised distances.
Table III summarises MAP scores; in Fig. 4 (c), (d) we test for differences in performance among combinations of distances based on continuousvalued prediction. Compared to using the baseline NMSE alone, combining NMSE with significantly improves performance for both the Jazz dataset and MSD. In addition, Table IV reports performance in terms of mean precision at ranks . Matching previous observations, for Jazz dataset and MSD, the combination of NMSE and consistently outperforms remaining combinations. At rank , relative to the NMSE baseline, we obtain a performance gain of for the Jazz dataset and for the MSD.
Dataset  Jazz  MSD  

Method  NCDA  NCD  NCDA  NCD 
PPM  0.220 0.021  0.249 0.021  0.0460 0.0024  0.0487 0.0025 
BW  0.143 0.016  0.220 0.019  0.0428 0.0023  0.0480 0.0024 
LZ  0.196 0.019  0.168 0.017  0.0457 0.0024  0.0438 0.0023 
PPM;  0.329 0.022  0.0428 0.0022  
LZ;  0.288 0.021  0.0415 0.0022  
JSD  0.289 0.022  0.0412 0.0023  
(continuous)  0.454 0.024  0.0498 0.0025  
NID (continuous)  0.346 0.023  0.0303 0.0020  
NMSE (continuous)  0.459 0.023  0.0499 0.0025  
Ellis and Poliner [41]  0.465 0.024  0.0404 0.0023  
Random  0.026 0.004  0.0006 0.0001  
& NMSE (cont.)  0.496  0.0516 0.0025  
& NID & NMSE (cont.)  0.432  0.0463 0.0024 
Summary of mean average precision scores. First three rows denote compression based approaches. Intervals are standard errors. ‘Random’ denotes sampling pairwise distances from a normal distribution.
Dataset  Jazz  MSD  

5  10  20  5  10  20  
0.185  0.113  0.065  0.0276  0.0146  0.0077  
NID  0.133  0.075  0.045  0.0147  0.0082  0.0044 
NMSE  0.193  0.116  0.067  0.0270  0.0141  0.0075 
& NMSE  0.213  0.123  0.070  0.0288  0.0150  0.0079 
& NID & NMSE  0.168  0.101  0.063  0.0265  0.0146  0.0076 
Vi Conclusions
We have evaluated measures of pairwise predictability between time series for cover song identification. We consider alternative distance measures to the NCD: We propose NCDA, which incorporates a method for obtaining joint representations of time series, in addition to methods based on crossprediction. Secondly, we attend to the issue of representing time series: We propose continuousvalued prediction as a means of determining pairwise similarity, where we estimate compressibility as a statistic of the prediction error. We contrast methods requiring feature quantisation, against methods directly applicable to continuousvalued features.
Firstly, the proposed continuousvalued approach outperforms discretevalued approaches and competes with evaluated continuous baseline approaches. Secondly, we draw attention to using crossprediction as an alternative approach to the NCD, where we observe superior results in both discrete and continuous cases for Jazz cover song identification, and for the continuous case for cover song identification using the Million Song Dataset. Thirdly, using NCDA, we are able to mitigate differences in performance between evaluated discrete compression algorithms. We view the previous three points as evidence that using informationbased measures of similarity, a continuousvalued representation may be preferable to discretevalued chroma representations, owing to the challenge of obtaining discretevalued representations. Further, NCD may yield suboptimal performance compared to alternative distance measures.
We argue that due to the ubiquity of time series similarity problems, our results are relevant to application domains extending beyond the scope of this work. Finally, in the context of cover song identification, we have demonstrated stateoftheart performance using a largescale dataset. We have shown that our distances based on continuousvalued prediction may be combined to improve performance relative to the baseline.
For future work, we aim to evaluate alternative time series models to those presently considered. To this end, further investigations might involve causal state space reconstruction [80]
or recurrent neural networks such as the long short term memory architecture
[81]. For future work, we aim to evaluate ensemble techniques for combining distances in greater detail.journal
References
 [1] M. A. Casey, R. Veltkamp, M. Goto, M. Leman, C. Rhodes, and M. Slaney, “Contentbased music information retrieval: current directions and future challenges,” Proc. IEEE, vol. 96, no. 4, pp. 668–696, 2008.
 [2] P. Cano, E. Batlle, T. Kalker, and J. Haitsma, “A review of audio fingerprinting,” Journal VLSI Signal Process., vol. 41, no. 3, pp. 271–284, 2005.
 [3] J. Serrà, “Identification of versions of the same musical composition by processing audio descriptions,” Ph.D. dissertation, Universitat Pompeu Fabra, 2011.
 [4] H. Fujihara, M. Goto, T. Kitahara, and H. G. Okuno, “A modeling of singing voice robust to accompaniment sounds and its application to singer identification and vocaltimbresimilaritybased music information retrieval,” IEEE Trans. Audio, Speech, Language Process., vol. 18, no. 3, pp. 638–648, 2010.

[5]
M. I. Mandel and D. P. W. Ellis, “Songlevel features and support vector machines for music classification,” in
Proc. 6th Intern. Conf. Music Information Retrieval (ISMIR), 2005, pp. 594–599.  [6] N. Scaringella, G. Zoia, and D. J. Mlynek, “Automatic genre classification of music content: a survey,” IEEE Signal Process. Magazine, vol. 23, no. 2, pp. 133–141, 2006.
 [7] J. Serrà, E. Gómez, P. Herrera, and X. Serra, “Chroma binary similarity and local alignment applied to cover song identification,” IEEE Trans. Audio, Speech, Language Process., vol. 16, no. 6, pp. 1138–1151, 2008.
 [8] L. Meyer, Music and Emotion. University of Chicago Press, 1956.
 [9] C. E. Shannon, “A mathematical theory of communication,” Bell System Technical Journal, vol. 27, pp. 379–423 623–656, 1948.
 [10] D. Huron, Sweet Anticipation: Music and the Psychology of Expectation. MIT press, 2006.
 [11] M. T. Pearce and G. A. Wiggins, “Auditory expectation: The information dynamics of music perception and cognition,” Topics in Cognitive Science, vol. 4, no. 4, 2012.
 [12] S. Abdallah and M. D. Plumbley, “Information dynamics: patterns of expectation and surprise in the perception of music,” Connection Science, vol. 21, no. 23, pp. 89–117, 2009.
 [13] P. Foster, S. Dixon, and A. Klapuri, “Identification of cover songs using information theoretic measures of similarity,” in Proc. IEEE Intern. Conf. Acoustics, Speech, and Signal Process. (ICASSP), 2013.
 [14] M. Li and P. Vitányi, An Introduction to Kolmogorov Complexity and its Applications. Springer, 2008.
 [15] M. Li, X. Chen, X. Li, B. Ma, and P. M. B. Vitányi, “The similarity metric,” IEEE Trans. Inf. Theory, vol. 50, no. 12, pp. 3250–3264, 2004.
 [16] A. Kocsor, A. KertészFarkas, L. Kaján, and S. Pongor, “Application of compressionbased distance measures to protein sequence classification: a methodological study,” Bioinformatics, vol. 22, no. 4, pp. 407–412, 2006.
 [17] A. Bardera, M. Feixas, I. Boada, and M. Sbert, “Image registration by compression,” Information Sciences, vol. 180, no. 7, pp. 1121–1133, 2010.
 [18] S. Wehner, “Analyzing worms and network traffic using compression,” Journal of Computer Security, vol. 15, no. 3, pp. 303–320, 2007.
 [19] M. Li and R. Sleep, “Melody classification using a similarity metric based on Kolmogorov complexity,” Sound and Music Computing, pp. 126–129, 2004.
 [20] R. Cilibrasi, P. M. B. Vitányi, and R. Wolf, “Algorithmic clustering of music based on string compression,” Computer Music Journal, vol. 28, no. 4, pp. 49–67, 2004.
 [21] M. Li and R. Sleep, “Genre classification via an LZ78based string kernel,” in Proc. 6th Intern. Conf. Music Information Retrieval (ISMIR), 2005.
 [22] M. Helén and T. Virtanen, “A similarity measure for audio query by example based on perceptual coding and compression,” in Proc. 10th Intern. Conf. Digital Audio Effects (DAFX), 2007.
 [23] T. E. Ahonen, “Measuring harmonic similarity using PPMbased compression distance,” in Proc. Workshop Exploring Musical Information Spaces (WEMIS), 2009, pp. 52–55.
 [24] J. P. Bello, “Measuring structural similarity in music,” IEEE Trans. Audio, Speech, Language Process., vol. 19, no. 7, pp. 2013–2025, 2011.
 [25] T. BertinMahieux, D. Ellis, B. Whitman, and P. Lamere, “The million song dataset,” in ISMIR 2011: Proceedings of the 12th International Society for Music Information Retrieval Conference, October 2428, 2011, Miami, Florida. University of Miami, 2011, pp. 591–596.
 [26] A. Berenzweig, B. Logan, D. P. W. Ellis, and B. Whitman, “A largescale evaluation of acoustic and subjective musicsimilarity measures,” Computer Music Journal, vol. 28, no. 2, pp. 63–76, 2004.
 [27] B. Logan and A. Salomon, “A music similarity function based on signal analysis,” in Proc. IEEE Intern. Conf. Multimedia and Expo. (ICME), 2001, pp. 745–748.
 [28] J. Aucouturier and F. Pachet, “Music similarity measures: What’s the use?” in Proc. 3rd Intern. Conf. Music Information Retrieval (ISMIR), 2002, pp. 157–163.

[29]
J. Aucouturier, B. Defreville, and F. Pachet, “The bagofframes approach to audio pattern recognition: A sufficient model for urban soundscapes but not for polyphonic music,”
Journal Acoustical Society of America, vol. 122, pp. 881–891, 2007.  [30] Z. Fu, G. Lu, K. M. Ting, and D. Zhang, “Music classification via the bagoffeatures approach,” Pattern Recognition Letters, vol. 32, no. 14, pp. 1768–1777, 2011.

[31]
M. Helén and T. Virtanen, “Audio query by example using similarity measures between probability density functions of features,”
EURASIP Journal Audio, Speech, and Music Process., vol. 2010, 2010.  [32] F. Lerdahl and R. Jackendoff, A Generative Theory of Tonal Music. MIT Press, 1983.
 [33] M. A. Casey and M. Slaney, “The importance of sequences in musical similarity,” in Proc. IEEE Intern. Conf. Acoustics, Speech and Signal Process. (ICASSP), vol. 5, 2006.
 [34] J. Paulus, M. Müller, and A. Klapuri, “State of the art report: Audiobased music structure analysis,” in Proc. 11th Intern. Society for Music Information Retrieval Conf. (ISMIR), 2010, pp. 625–636.
 [35] T. Fujishima, “Realtime chord recognition of musical sound: a system using common Lisp music,” in Proc. Intern. Computer Music Conf. (ICMC), 1999, pp. 464–467.
 [36] M. A. Bartsch and G. H. Wakefield, “To catch a chorus: Using chromabased representations for audio thumbnailing,” in IEEE Workshop Applications of Signal Process. to Audio and Acoustics (WASPAA), 2001, pp. 15–18.
 [37] J. Foote, “ARTHUR: Retrieving orchestral music by longterm structure,” in Proc. Intern. Symp. Music Information Retrieval (ISMIR), 2000.
 [38] E. Gómez and P. Herrera, “The song remains the same: Identifying versions of the same piece using tonal descriptors,” in Proc. 7th Intern. Conf. Music Information Retrieval (ISMIR), 2006.
 [39] J. Serrà, X. Serra, and R. G. Andrzejak, “Cross recurrence quantification for cover song identification,” New Journal of Physics, vol. 11, no. 9, p. 093017, 2009.
 [40] J. Serrà, H. Kantz, X. Serra, and R. G. Andrzejak, “Predictability of music descriptor time series and its application to cover song detection,” IEEE Trans. Audio, Speech, Language Process., vol. 20, no. 2, pp. 514–525, 2012.
 [41] D. P. W. Ellis and G. E. Poliner, “Identifying ‘cover songs’ with chroma features and dynamic programming beat tracking,” in Proc. IEEE Intern. Conf. Acoustics, Speech and Signal Process. (ICASSP), vol. 4, 2007, pp. 1429–1432.
 [42] J. H. Jensen, M. G. Christensen, and S. H. Jensen, “A chromabased tempoinsensitive distance measure for cover song identification using the 2D autocorrelation function,” in Music Information Retrieval Evaluation Exchange Task Audio Cover Song Identification, 2008.
 [43] T. BertinMahieux and D. P. W. Ellis, “Largescale cover song recognition using the 2D Fourier transform magnitude,” in Proc. 13th Intern. Society for Music Information Retrieval Conf. (ISMIR), 2012, pp. 241–246.
 [44] W. Tsai, H. Yu, and H. Wang, “A querybyexample technique for retrieving cover versions of popular songs with similar melodies,” in Proc. 6th Intern. Conf. Music Information Retrieval (ISMIR), 2005, pp. 183–190.
 [45] J. P. Bello, “Audiobased cover song retrieval using approximate chord sequences: testing shifts, gaps, swaps and beats,” in Proc. 8th Intern. Conf. Music Information Retrieval (ISMIR), 2007, pp. 239–244.
 [46] K. Lee, “Identifying cover songs from audio using harmonic representation,” in Music Information Retrieval Evaluation Exchange Task Audio Cover Song Identification, 2006.
 [47] B. Martin, D. G. Brown, P. Hanna, and P. Ferraro, “BLAST for audio sequences alignment: a fast scalable cover identification tool,” in Proc. 13th Intern. Society for Music Information Retrieval Conf. (ISMIR), 2012.
 [48] T. Ahonen, “Combining chroma features for cover version identification,” in Proc. 11th Intern. Society for Music Information Retrieval Conf. (ISMIR), 2010, pp. 165–170.
 [49] J. Cleary and I. Witten, “Data compression using adaptive coding and partial string matching,” IEEE Trans. Commun., vol. 32, no. 4, pp. 396–402, 1984.
 [50] T. Ahonen, “Compressionbased clustering of chromagram data: New method and representations,” in Proc. 9th Intern. Symposium Computer Music Modeling and Retrieval, 2012, pp. 474–481.
 [51] M. Burrows and D. J. Wheeler, “A blocksorting lossless data compression algorithm,” Digital Equipment Corporation, Tech. Rep., 1994.
 [52] I. Tabus, V. Tabus, and J. Astola, “Information theoretic methods for aligning audio signals using chromagram representations,” in Proc. 5th Intern. Symp. Communications Control and Signal Process. (ISCCSP), 2012, pp. 1–4.
 [53] D. Silva, H. Papadopoulos, G. Batista, and D. Ellis, “A video compressionbased approach to measure music structural similarity,” in Proc. 14th Intern. Society for Music Information Retrieval Conf. (ISMIR), 2013, pp. 95–100.
 [54] M. Casey, C. Rhodes, and M. Slaney, “Analysis of minimum distances in highdimensional musical spaces,” IEEE Trans. Audio, Speech, and Language Process., vol. 16, no. 5, pp. 1015–1028, 2008.
 [55] M. Slaney and M. Casey, “Localitysensitive hashing for finding nearest neighbors,” IEEE Signal Process. Magazine, vol. 25, no. 2, pp. 128–131, 2008.
 [56] T. BertinMahieux and D. Ellis, “Largescale cover song recognition using hashed chroma landmarks,” in IEEE Workshop Applications of Signal Process. to Audio and Acoustics (WASPAA). IEEE, 2011, pp. 117–120.
 [57] M. Khadkevich and M. Omologo, “Largescale cover song identification using chord profiles,” in Proc. 14th Intern. Society for Music Information Retrieval Conf. (ISMIR), 2013, pp. 233–238.
 [58] D. Schnitzer, A. Flexer, and G. Widmer, “A filterandrefine indexing method for fast similarity search in millions of music tracks.” in Proc. 10th Intern. Society for Music Information Retrieval Conf. (ISMIR), 2009, pp. 537–542.
 [59] D. Loewenstern, H. Hirsh, P. Yianilos, and M. Noordewier, “DNA sequence classification using compressionbased induction,” Center for Discrete Mathematics and Theoretical Computer Science, Tech. Rep., 1995.
 [60] J. Ziv and A. Lempel, “A universal algorithm for sequential data compression,” IEEE Trans. Inf. Theory, vol. 23, no. 3, pp. 337–343, 1977.
 [61] J. Ziv and N. Merhav, “A measure of relative entropy between individual sequences with application to universal classification,” IEEE Trans. Inf. Theory, vol. 39, no. 4, pp. 1270–1279, 1993.
 [62] D. Benedetto, E. Caglioti, and V. Loreto, “Language trees and zipping,” Physical Review Letters, vol. 88, no. 4, p. 48702, 2002.
 [63] R. Cilibrasi and P. M. B. Vitányi, “Clustering by compression,” IEEE Trans. Inf. Theory, vol. 51, no. 4, pp. 1523–1545, 2005.

[64]
D. Sculley and C. E. Brodley, “Compression and machine learning: A new perspective on feature space vectors,” in
Proc. Data Compression Conf. (DCC), 2006, pp. 332–341.  [65] P. Grünwald and P. M. B. Vitányi, “Shannon information and Kolmogorov complexity,” arXiv eprint cs/0410002, 2004.
 [66] A. Kaltchenko, “Algorithms for estimating information distance with application to bioinformatics and linguistics,” in Proc. IEEE Canadian Conf. Electrical and Computer Engineering (CCECE), vol. 4, 2004, pp. 2255–2258.
 [67] M. Feder, N. Merhav, and M. Gutman, “Universal prediction of individual sequences,” IEEE Trans. Inf. Theory, vol. 38, no. 4, pp. 1258–1270, 1992.
 [68] M. Feder and N. Merhav, “Relations between entropy and error probability,” IEEE Trans. Inf. Theory, vol. 40, no. 1, pp. 259–266, 1994.

[69]
R. Begleiter, R. ElYaniv, and G. Yona, “On prediction using variable order
Markov models,”
Journal of Artificial Intelligence Research
, vol. 22, pp. 385–421, 2004.  [70] F. Takens, “Detecting strange attractors in turbulence,” Dynamical systems and turbulence, pp. 366–381, 1981.
 [71] E. Gómez, “Tonal description of music audio signals,” Ph.D. dissertation, Universitat Pompeu Fabra, 2006.
 [72] P. Foster, A. Klapuri, and M. D. Plumbley, “Causal prediction of continuousvalued music features,” in Proc. 12th Intern. Society for Music Information Retrieval Conf. (ISMIR), 2011, pp. 501–506.
 [73] D. P. W. Ellis, “Beat tracking with dynamic programming,” in Music Information Retrieval Evaluation Exchange Tasks on Audio Tempo Extraction and Audio Beat Tracking, 2006.
 [74] T. Jehan, “Analyzer documentation,” The Echo Nest, Tech. Rep., 2011.
 [75] S. Ravuri and D. P. W. Ellis, “Cover song detection: from high scores to general classification,” in Proc. IEEE Intern. Conf. Acoustics Speech and Signal Process. (ICASSP), 2010, pp. 65–68.
 [76] J. Osmalskyj, S. Pierard, M. Van Droogenbroeck, and J. Embrechts, “Efficient database pruning for largescale cover song recognition,” in Proc. IEEE Intern. Conf. Acoustics, Speech and Signal Process. (ICASSP). IEEE, 2013.
 [77] M. Friedman, “The use of ranks to avoid the assumption of normality implicit in the analysis of variance,” Journal American Statistical Association, vol. 32, no. 200, pp. 675–701, 1937.
 [78] J. W. Tukey, The Problem of Multiple Comparisons. Princeton University, 1973.

[79]
J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas, “On combining classifiers,”
IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 20, no. 3, pp. 226–239, 1998.  [80] C. R. Shalizi and K. L. Shalizi, “Blind construction of optimal nonlinear recursive predictors for discrete sequences,” in Proc. 20th Conf. Uncertainty in Artificial Intelligence (UAI), 2004, pp. 504–511.
 [81] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
Iv Experimental Method
We first evaluate our proposed methods using a set of audio recordings of Jazz standards ^{1}^{1}1http://www.eecs.qmul.ac.uk/~peterf/jazzdataset.html. We assume that two tracks are a cover pair if they possess identical title strings. Thus, we assume a symmetric relation when determining cover identities. The equivalence class of tracks deemed to be covers of one another is a cover set. The Jazz data set comprises 97 cover sets, with average cover set size 3.06 tracks.
Furthermore, we perform a largescale evaluation based on the MSD [25]. This dataset includes metadata and precomputed audio features for a collection of Western popular music recordings. We use a predefined evaluation set of 5 236 query tracks partitioned into 1 726 cover sets ^{2}^{2}2http://labrosa.ee.columbia.edu/millionsong/secondhand, with average cover set size 3.03 tracks. Following [43], for each query track, we seek to identify the remaining cover set members contained in the entire track collection.
Iva Feature Extraction
For the Jazz dataset, as a representation of musical harmonic content, we extract 12component beatsynchronous chroma features from audio using the method and implementation described in [41]. Assuming an equaltempered scale, the method accounts for deviations in standard pitch from 440Hz, by shifting the mapping of FFT bins to pitches in the range of semitones. Following chroma extraction, beatsynchronisation is achieved using the method described in [73]. First, onset detection is performed by differencing a logmagnitude Melfrequency spectrogram across time and applying halfwave rectification, before summing across frequency bands. After highpass filtering the onset signal, a tempo estimate is formed by applying a window function to the autocorrelated onset signal and determining autocorrelation maxima. Varying the centre of the window function allows tempo estimation to incorporate a bias towards a preferred beat rate (PBR). The tempo estimate and onset signal are then used to obtain an optimal set of beat onsets, by using dynamic programming. Chroma features are averaged over beat intervals, before applying squareroot compression and normalising chroma features with respect to the Euclidean norm. Based on our previous work [13], we evaluate using a PBR of 240 beats per minute (bpm).
The MSD includes 12component chroma features alongside predicted note and beat onsets [74], which we use in our evaluations. In contrast to the beatsynchronous features obtained for the Jazz dataset, MSD chroma features are initially aligned to predicted onsets. Motivated by our choice of PBR for the Jazz dataset, we resample predicted beat onsets to a rate of 240bpm. We then average chroma features over resampled beat intervals. Finally, we normalise features as described for the Jazz dataset.
IvB Key Invariance
To account for musical key variation within cover sets, we transpose chroma sequences using the optimal transposition index (OTI) method [7]. Given two chroma vector sequences , , we form summary vectors , by averaging over entire sequences. The OTI corresponds to the number of circular shift operations applied to which maximises the inner product between and ,
(29) 
where denotes applying circular shift operations to . We subsequently shift chroma vectors by positions, prior to pairwise comparison.
IvC Quantisation
For discretevalued similarity measures, we quantise chroma features using the means algorithm. We cluster chroma features aggregated across all tracks, where we consider codebook sizes in the range . To increase stability, we execute the means algorithm times. We then select the clustering which minimises the mean squared error between data points and assigned clusters. The described quantisation method performs similarly to an alternative based on pairwise sequence quantisation; for a detailed discussion we refer to our previous work [13].
IvD Distance Measures
We summarise the distance measures evaluated in this work in Table I, where for each distance measure, we list our estimation methods.
We utilise the following algorithms to compute distance measures by compressing strings: Prediction by partial matching (PPM) [49], BurrowsWheeler (BW) compression [51] and LempelZiv (LZ) compression [60], implemented respectively as PPMD ^{3}^{3}3http://compression.ru/ds/, BZIP2 ^{4}^{4}4http://bzip2.org and ZLIB ^{5}^{5}5http://zlib.org. In all cases, we set parameters to favour compression rates over computation time. To obtain strings, following quantisation we map integer codewords to alphanumeric characters.
We use the described compression algorithms to determine the length in bits of compressed strings and compute NCD, NCDA distances. In a complementary discretevalued approach, we use string prediction instead of compression. Using average logloss, we compute NCDA using the formula
(30) 
where is the average logloss obtained from performing selfprediction on the aligned sequence . We compute a predictionbased variant of NCD analogously by predicting sequentially concatenated strings without performing any alignment. In addition, we use crossprediction to estimate distance measure , as defined in (18). We perform string prediction using Begleiter’s [69] implementations of PPMC and LZ78 algorithms.
Note that the KL divergence given in (1) is nonsymmetric. In our evaluations, we observed that computing a symmetric distance improved performance; based on KL divergence, we compute the JensenShannon divergence (JSD) , defined as
(31) 
where denotes the mean of ,
(32) 
As a baseline method, we compute the JSD between symbol histograms normalised to sum to one.
We evaluate continuousvalued prediction using timedelay embedding parameters , , , setting the exclusion radius in (24) to based on preliminary analysis using separate training data. We compute distance measure using crossprediction to estimate the numerator in (18). In a complementary approach, we estimate the NID using conditional selfprediction to estimate the numerator in (12). For and NID, we use selfprediction to estimate the denominator in (18), (12), respectively.
Finally, to compensate for cover song candidates consistently deemed similar to query tracks, we normalise pairwise distances using the method described in [75]. We apply distance normalisation as a postprocessing step, before computing performance statistics.
Distance journalmeasure  Definition  Estimation method  

Eqn. 2  String compression (LZ, BW, PPM)  journalDiscrete prediction (LZ, PPM)  
preprint Discrete prediction (LZ, PPM)  
Eqn. 13  String compression (LZ, BW, PPM)  journalDiscrete prediction (LZ, PPM)  
preprint Discrete prediction (LZ, PPM)  
Eqn. 18  Discrete prediction  journalContinuous prediction  
preprint Continuous prediction  
Eqn. 31  Normalised symbol histograms  
Eqn. 12  Continuous prediction 
IvE Largescale Cover Song Identification
For music content analysis involving large datasets, algorithm scalability is an important issue. The approaches in this work by themselves require a linear scan through the dataset for a given query, which may be infeasible for large datasets. We use a scalable approach for our evaluations involving the MSD. Following [57] and similar to the method proposed in [76], we incorporate our methods into a twostage retrieval process. By using a metric distance to determine similarity in the first retrieval stage, we allow for the potential use of indexing or hashing schemes, as proposed in [54, 58]. We then apply nonmetric pairwise comparisons in the second retrieval stage.
In the first stage, we quantise as described in Section IVC and represent each track with a normalised codeword histogram. Given a query track, we then rank each of the candidate tracks using the L1 distance. To account for key variation, for each candidate track we minimise L1 distance across chroma rotations. We then determine the top candidate tracks, which we rerank in the second stage using our proposed methods. After both retrieval stages, we normalise pairwise distances as described in Section IVD. We report performance based on the final ranking of all candidate tracks, across query tracks.
IvF Performance Statistics
As used in [24], we quantify cover song identification accuracy using mean average precision (MAP), based on ranking tracks according to distance with respect to queries. The MAP is obtained by averaging querywise scores, where we may interpret each score as the average of precision values at the ranks of relevant tracks, where relevant tracks in our case are covers of the query track. Following [24], we use the Friedman test [77] to compare accuracies among distance measures. The Friedman test is based on ranking across queries each distance measure according to average precision. We combine the Friedman test with Tukey’s range test [78]
to adjust for Type I errors when performing multiple comparisons.
As a subsidiary performance measure, for each query we compute the precision at rank , with . We subsequently average across queries to obtain mean precision at rank .
IvG Combining Distance Measures
To determine if combining distance measures improves cover song identification accuracy, we obtain pairwise distances as described in Section IVD. We denote with the pairwise distance between the th query track and the th result candidate, obtained using the th distance measure in our evaluation. We transform by computing the inverse rank ,
(33) 
where denotes the rank of among all distances obtained with respect to query track , given the
th distance measure. We apply this transformation to protect against outliers, while ensuring that distance decreases rapidly for track pairs deemed highly similar, for decreasing distance. Note that since our distance transformation preserves monotonicity and MAP itself is based on ranked distances, performance of unmixed distance measures is uninfluenced by this transformation. Finally, we combine distances
, by computing a weighted average of distances pooled using and operators,(34) 
where we vary in the range . We motivate our approach on the basis that we may interpret inverse ranks as estimated probabilities of cover identities, furthermore the operators and have been proposed as a means of combining probability estimates for classification [79]. In forming a linear combination, we evaluate the utility of pooling versus pooling. An alternative approach based on straightforward averaging did not yield any performance gain.
IvH Baseline Approaches
In addition to the JSD and crossprediction NMSE baselines, we include an evaluation of the method and implementation described in [41] based on crosscorrelation. As a random baseline, we sample pairwise distances from a normal distribution.
V Results
Va DiscreteValued Approaches Based on Compression
In Fig. 2 (a)–(c), we examine the performance of discretevalued NCD and NCDA distance measures, combined with LZ, BW and PPM algorithms and based on the Jazz dataset. For the LZ algorithm, NCDA yields a relative performance gain of %, averaged across codebook sizes. In contrast, for PPM, with the exception of small codebook sizes in the range , NCDA yields no consistent improvement over NCD, however averaged across codebook sizes we obtain a mean relative performance gain of %. Finally, the effect of using NCDA is reversed for BW compression, where performance decreases by an average of %.
Examining results for the MSD in Fig. 2 (e)–(g), we observe similar qualitative results for LZ and BW algorithms. For the LZ algorithm, NCDA yields an average relative performance gain of %, whereas for BW compression we observe an average relative performance loss of %. In contrast to the Jazz dataset, for PPM we observe an average relative performance loss of %.
For both datasets, NCDA appears to be most advantageous combined with LZ compression, whereas BW yields the least advantageous result. Note that BW compression is blockbased in contrast to LZ and PPM compressors, both of which are sequential. We attribute this observation to performance differences among compressors, since the assumptions made in Section IIIB rely on assuming Markov sources. Noting differences in relative performance gains between datasets, following [57] we further conjecture that chroma feature representation influences the performance of the evaluated distance measures.
We examine the performance of JSD between normalised symbol histograms, as displayed in Fig. 2 (d), (h). Surprisingly, for the Jazz dataset and for , JSD outperforms compressionbased methods, with maximum MAP score obtained for . This result is contrary to our expectation that NCD approaches should outperform the bagoffeatures approach, by accounting for temporal structure in time series. In contrast, for the MSD and for optimal , both NCD and NCDA outperform JSD across all evaluated compression algorithms. We attribute this disparity to differences in dataset size, where for the Jazz dataset the problem size may be sufficiently small to amortise advantages of using NCD, NCDA compared to JSD.
VB DiscreteValued Approaches Based on Prediction
In Fig. 3, we consider the performance of distance measures based on string prediction. For the Jazz dataset, comparing logloss estimates of NCD and NCDA using the LZ algorithm, averaged across codebook sizes NCDA outperforms NCD; we obtain a mean relative performance gain of (Fig. 3 (a)). For the PPM algorithm, although NCD maximises performance (MAP ), we obtain a mean relative performance gain of % using NCDA over NCD (Fig. 3 (b)). Importantly, for both LZ and PPM the crossprediction distance consistently outperforms NCD and NCDA; for and combined with PPM compression, we obtain MAP . For the MSD and using LZ compression, in contrast to the Jazz dataset we observe a mean relative performance loss of % when comparing with NCDA. For both LZ and PPM, NCDA compared to NCD yields mean relative performance gains of % and %, respectively.
VC ContinuousValued Approaches
Table II displays the performance of continuousvalued prediction approaches. Note that for , parameter may be set to an arbitrary integer following (20). We consider results obtained for the Jazz dataset (Table II (a)–(c)). Using conditional selfprediction to estimate the NID, maximised across parameters we obtain MAP . In comparison, crossprediction distance yields MAP . As a baseline, we determine the crossprediction NMSE, where maximising across parameters we obtain MAP . Table II (a)–(c) displays performance against evaluated parameter combinations. Examining results for the MSD in Table II (d)–(f), we obtain qualitatively similar results with maximum MAP values , and for NID, and NMSE, respectively. For both datasets, we observe that increasing the value of consistently improves performance. In contrast, we observe no such effect for parameters .
VD Summary of Results and Comparison to State of the Art
Fig. 4 (a), (b) displays the result of significance testing as described in Section IVF
, where we assume 95% confidence intervals and where we maximise across evaluated parameter spaces. Table
III displays a corresponding summary of MAP scores. As baselines we include Ellis and Poliner’s crosscorrelation approach [41], in addition to randomly sampled pairwise distances. For the MSD, when used without any further refinement method, our filtering stage based on normalised codeword histograms yields MAP 0.0056.For both Jazz dataset and MSD, we observe that continuousvalued approaches based on crossprediction consistently outperform discretevalued approaches. Moreover, with the exception of NCD combined with PPMbased string compression and for the MSD, using continuousvalued crossprediction significantly outperforms discretevalued approaches. For approaches based on string compression, we note that using NCDA with BW compression significantly decreases performance with respect to NCD. Similarly, using NCDA decreases MAP scores for PPM. Although we do not observe a significant performance gain using NCDA over NCD for LZ compression, performance improves consistently across datasets. For the Jazz dataset, we observe that the JSD baseline significantly outperforms the majority of stringcompression approaches. In contrast, for the MSD the majority of stringcompression approaches significantly outperform the JSD baseline. Whereas PPM with distance consistently outperforms all discretevalued approaches for the Jazz dataset, PPM with compressionbased NCD consistently outperforms all discretevalued approaches for the MSD and significantly outperforms the JSD baseline.
In a comparison of continuousvalued approaches, we observe that crossprediction using either distance or NMSE competes with crosscorrelation for the Jazz dataset. In contrast, the same crossprediction approaches significantly outperform crosscorrelation for the MSD.
Examining continuousvalued approaches further, for both Jazz dataset and MSD, we observe a significant disadvantage in using our conditional selfprediction based estimate of NID, over crossprediction based distances and NMSE. The relatively poor performance of NID for the MSD suggests a limitation of our prediction approach when used with MSD chroma features. However, considering results for both datasets suggests that crossprediction yields more favourable results than conditional selfprediction generally.
To facilitate further comparison, we consider the approaches proposed by BertinMahieux and Ellis [43], Khadkevich and Omologo [57], who report MAP scores of 0.0295, 0.0371, respectively. Based on such a comparison, we obtain stateoftheart results. Note that the stated approaches do not report any distance normalisation procedure as described in Section IVD; we found that normalisation improves our results: For the Jazz dataset and using unnormalised distances, we obtain MAP scores , , for NMSE, , NID, respectively. For the MSD and using unnormalised distances, we obtain MAP scores , , , for NMSE, , NID, respectively.
VE Combining Distances
Finally, using the method described in Section IVG, we combine distances obtained using continuousvalued prediction. Fig. 5 displays MAP scores against mixing parameter , for Jazz dataset and MSD. We consider the combinations , , the latter combination which we evaluate with respect to optimal for the former combination.
Compared to using the baseline NMSE alone, across all and for both datasets we observe that combining NMSE with improves performance: For the Jazz dataset, we observe maximal MAP score 0.496, corresponding to a gain of %. For the MSD, we observe maximal MAP score , corresponding to a gain of . We observe no performance gain by further combining NID estimates with NMSE and , obtaining maximal MAP scores 0.432 and 0.0463 respectively for Jazz dataset and MSD. Additional evaluations revealed no performance gain using unnormalised distances.
Table III summarises MAP scores; in Fig. 4 (c), (d) we test for differences in performance among combinations of distances based on continuousvalued prediction. Compared to using the baseline NMSE alone, combining NMSE with significantly improves performance for both the Jazz dataset and MSD. In addition, Table IV reports performance in terms of mean precision at ranks . Matching previous observations, for Jazz dataset and MSD, the combination of NMSE and consistently outperforms remaining combinations. At rank , relative to the NMSE baseline, we obtain a performance gain of for the Jazz dataset and for the MSD.
Dataset  Jazz  MSD  

Method  NCDA  NCD  NCDA  NCD 
PPM  0.220 0.021  0.249 0.021  0.0460 0.0024  0.0487 0.0025 
BW  0.143 0.016  0.220 0.019  0.0428 0.0023  0.0480 0.0024 
LZ 
Comments
There are no comments yet.