The "Horse" Inside: Seeking Causes Behind the Behaviours of Music Content Analysis Systems

by   Bob L. Sturm, et al.

Building systems that possess the sensitivity and intelligence to identify and describe high-level attributes in music audio signals continues to be an elusive goal, but one that surely has broad and deep implications for a wide variety of applications. Hundreds of papers have so far been published toward this goal, and great progress appears to have been made. Some systems produce remarkable accuracies at recognising high-level semantic concepts, such as music style, genre and mood. However, it might be that these numbers do not mean what they seem. In this paper, we take a state-of-the-art music content analysis system and investigate what causes it to achieve exceptionally high performance in a benchmark music audio dataset. We dissect the system to understand its operation, determine its sensitivities and limitations, and predict the kinds of knowledge it could and could not possess about music. We perform a series of experiments to illuminate what the system has actually learned to do, and to what extent it is performing the intended music listening task. Our results demonstrate how the initial manifestation of music intelligence in this state-of-the-art can be deceptive. Our work provides constructive directions toward developing music content analysis systems that can address the music information and creation needs of real-world users.


page 3

page 10

page 15

page 21


Audio Content Analysis

Preprint for a book chapter introducing Audio Content Analysis. With a f...

Enabling Embodied Analogies in Intelligent Music Systems

The present methodology is aimed at cross-modal machine learning and use...

Multi-Modal Music Information Retrieval: Augmenting Audio-Analysis with Visual Computing for Improved Music Video Analysis

This thesis combines audio-analysis with computer vision to approach Mus...

Towards Deep Modeling of Music Semantics using EEG Regularizers

Modeling of music audio semantics has been previously tackled through le...

A Lyric-Based Approach for Brazilian Music Knowledge Discovery: Brazilian Country Music as a Case Study

Computational techniques can be used to identify musical trends and patt...

Learning Unsupervised Hierarchies of Audio Concepts

Music signals are difficult to interpret from their low-level features, ...

Graph Representation learning for Audio Music genre Classification

Music genre is arguably one of the most important and discriminative inf...

1 Introduction

A significant amount of research in the disciplines of music content analysis and content-based music information retrieval (MIR) is plagued by an inability to distinguish between solutions and “horses” [Gouyon et al. (2013), Urbano et al. (2013), Sturm (2014a), Sturm (2014b)]. In its most basic form, a “horse” is a system that appears as if it is solving a particular problem when it actually is not [Sturm (2014a)]. This was exactly the case with Clever Hans [Pfungst (1911)], a real horse that was claimed to be capable of doing arithmetic and other feats of abstract thought. Clever Hans appeared to answer complex questions posed to him, but he had actually learned to respond to involuntary cues of his many inquisitors confounded with the tapping of his hoof the correct number of times. The “trick” evaded discovery for a few reasons: 1) the cues were nearly undetectable; and 2) in light of undetected cues, the demonstration was thought by many to constitute valid evidence for the claim that the horse possessed such abilities. It was not until controlled experiments were designed and implemented that his true abilities were discovered [Pfungst (1911)].

If the aim of a music content analysis system is to enhance the connection between users (e.g., private listener, professional musician, scholar, journalist, family, organisation and business) and music (e.g., recordings in the format of a score and audio recording) and information about music (e.g., artist, tempi, instrumentation and title) [Casey et al. (2008)] – and to do so at a far lower cost than that required of human labor – then the system must operate with characteristics and criteria relevant to the information needs of users. For instance, a relevant characteristic for generating tempo information is periodic onsets; an irrelevant characteristic is instrumentation. If the aim of a music content analysis system is to facilitate creative pursuits, such as composing or performing music in particular styles [Dubnov et al. (2003), Dubnov and Surges (2014)], then it must operate with characteristics and criteria relevant to the creative needs of users. For instance, a relevant criterion for a Picardy third is the suggestion of a minor resolution; an irrelevant criterion is avoidance of parallel fifths. The importance of “relevant criteria” in music content analysis is evinced by frustration surrounding what has been termed the “semantic gap”: a chasm of disconnection between accessible but low-level features and high-level abstract ideas [Aucouturier (2009), Wiggins (2009), Turnbull et al. (2008)].

A music content analysis system’s reproduction of dataset ground truth is, by and large, considered valid evidence that the system is using relevant characteristics and criteria, or possesses “musical knowledge,” or has learned to listen to music in a way that is meaningful with respect to some music listening task. In one of the most cited papers in MIR, Tzanetakis2002 train and test several systems with what would become the most-used public benchmark dataset in music genre recognition [Sturm (2014b)]. Since these systems reproduced an amount of ground truth inconsistent with that expected when choosing classes randomly, Tzanetakis and Cook concluded that the features “provide some information about musical genre and therefore musical content in general,” and even that the systems’ performances are comparable to that of humans 2002. Tsunoo et al. (2009) conclude from such evidence that the features they propose “have enough information for genre classification because [classification accuracy] is significantly above the baselines of random classification.” Measuring the reproduction of the ground truth of a dataset is typical when developing content analysis systems. For instance, Song et al. (2012) and Su et al. (2014)

perform a large number of computational experiments to find the “most relevant” features, “optimal” parameters, “best” classifiers, and combinations thereof, all defined with respect to the reproduction of the ground truth.

The measurement of reproduced ground truth has been thought to be objective. Gouyon et al. (2004) avoid the “pitfall” of subjective evaluation of rhythm descriptors by “measuring their rate of success in genre classification experiments.” Lidy and Rauber (2008) argue that such “directly measured” numbers “[facilitate] (1) the comparison of feature sets and (2) the assessment of the suitability of particular classifiers for specific feature sets.” This is also echoed by Tzanetakis and Cook (2002).111Also see lecture by G. Tzanetakis, “UVic MIR Course”: (2014). During the 10-year life-span of MIREX – an established annual event that facilitates the exchange and scientific evaluation of new techniques for a wide variety of music content analysis tasks Downie (2004, 2008); Downie et al. (2010); Cunningham et al. (2012) – thousands of systems have been ranked according to the amount of ground truth they reproduce. Several literature reviews, e.g., Scaringella et al. (2006); Fu et al. (2011); Humphrey et al. (2013), tabulate results of many published experiments, and make conclusions about which features and classifiers are “useful” for listening tasks such as music genre and mood classification. Bergstra et al. (2006) remark on the progress up to that time, “Given the steady and significant improvement in classification [accuracy], we wonder if automatic methods are not already more efficient at learning genres than some people.” Seven years later, Humphrey et al. (2013) surmise from the plateauing of such numbers that progress in MIR has stalled. However, could it be that progress was never made at all? Might it be that the precise measurement of reproduced ground truth is not a reliable reflection of the “intelligence” so hoped for?

Figure 1: Figure of merit (FoM,

) of the music content analysis system DeSPerF-BALLROOM, the cause of which we seek in this article. Column is ground truth label, and row is class selected by system. Off diagonals are confusions. Precision is the right-most column, F-score is the bottom row, recall is the diagonal, and normalised accuracy (mean recall) is at bottom-right corner.

Consider the systems reproducing the most ground truth in the 2013 MIREX edition of the “Audio Latin Music Genre classification task” (ALGC).222 The aim of ALGC is to compare music content analysis systems built from algorithms submitted by participants in the task of classifying the music genres of recordings in the benchmark Latin Music Dataset (LMD) Silla et al. (2008)

. In ALGC, participants submit their feature extraction and machine learning algorithms, a MIREX organiser then uses these to build music content analysis systems, applies them to subsets of

LMD, and computes a variety of figures of merit (FoM) based on the reproduction of ground truth. In ALGC of 2013, the most ground truth (accuracy of ) was reproduced by systems built using deep learning Pikrakis (2013). Figure 1 shows the FoM of the system resulting from using the same winning algorithms, but training and testing it with the public benchmark BALLROOM dataset Dixon et al. (2004). (LMD is not public.)

Of little doubt is that the classification accuracy of this system – DeSPerF-BALLROOM – greatly exceeds that expected when selecting labels of BALLROOM randomly. The system has clearly learned something. Now, what is that something? What musical characteristics and criteria – “musical knowledge” – is this system using? How do the internal models of the system reflect the music “styles” in BALLROOM? Are the labels of BALLROOM even related to “style”? What has this system actually learned to do with BALLROOM? The success of DeSPerF-BALLROOM for the analytic or creative objectives of music content analysis turns on the cause of Fig. 1. How is the cause relevant to a user’s music information or creation needs? Is the system actually fit to enhance the connections between users, music, and information about music? Or is it as Clever Hans, only appearing to be intelligent?

In this article, it is the cause of Fig. 1 with which we are principally concerned. We seek to answer what this system has learned about the music it is classifying, its musical intelligence, i.e., its decision machinery involving high-level acoustic and musical characteristics of the “styles” from which the recordings BALLROOM appear to be sampled. Broader still, we seek to encourage completely new methods for evaluating any music content analysis system with respect to its objective. It would have been a simple matter if Hans could have been asked how he was accomplishing his feat; but the nature of his “condition” allowed only certain questions to be asked. In the end, it was not about finding the definitive set of questions that accurately measured his mental brawn, but of thinking skeptically and implementing appropriately controlled experiments designed to test hypotheses like, “Clever Hans can solve problems of arithmetic.” One faces the same problem in evaluating music content analysis systems: the kinds of questions that can be asked are limited. For DeSPerF-BALLROOM in Fig. 1

, a “question” must come in the form of a 220,500-dimensional vector (10 second monophonic acoustic music signal uniformly sampled at 22050 Hz). Having the system try to classify the Waltz recordings that are thought to be the hardest in some way will not illuminate much about the criteria it is using, the sanity of its internal models, or the causes of the FoM in Fig.

1. Ways forward are given by adopting and adapting Pfungst’s approach to testing Clever Hans Pfungst (1911), and above all not dispensing with skepticism.

Teaching a machine to listen to music, to automatically recognise music style or genre, are achievements so great that they require extraordinary and valid evidence. That these tasks defy the explicit definition necessary to the formal nature of algorithms produces great pause in accepting Fig. 1 as evidence that DeSPerF-BALLROOM is, unlike Clever Hans, not a “horse.” In Section 2, we provide a brief but explicit definition of the problem of music content analysis, and what a music content analysis system is. In Section 3, we dissect DeSPerF-based systems, describing in detail their construction and operation. In Section 4, we analyse the methods of teaching and testing encompassed by the BALLROOM dataset. We are then prepared for the series of experiments in Section 5 that seek to explain Fig. 1. We discuss our results more broadly in Section 6. We make available a reproducible research package with which one may generate all figures and tables in this article: (Made anonymous for the time being.)

2 The problem of music content analysis

Since this article is concerned with algorithms defined in no uncertain terms by a formal language and posed to solve some problem of music content analysis, we must define what all these things are. Denote the music universe , the music recording universe (notated or performed), vocabularies (features) and (tokens), and define the Boolean semantic rules and , where is a sequence of features from and is a sequence of tokens from . Define the semantic universe built from and :


The semantic feature universe is similarly built using and . Define a use case as the specification of , , , and a set of success criteria.

A music universe is the set of intangible music – whatever that is Goehr (1994) – from which the tangible recording music universe is produced. This distinction is important because the real world contains only tangible records of music. One can point to a score of Beethoven’s 5th, but not to Beethoven’s 5th. Perhaps one wishes to say something about Beethoven’s 5th, or about a recording of Beethoven’s 5th. These are categorically different. The definition of specifies the music in a use case, e.g., “music people call ‘disco’.” The definition of includes the specification of the dimensions of the tangible material, “30 second audio recording uniformly sampled at 44.1 kHz of an element of .” The definition of provides the semantic space in which elements of are described. Finally, the success criteria of a use case specify requirements for music content analysis systems to be deemed successful.

A music content analysis system is a map from to :


which itself is a composition of two maps, and . The map is commonly known as a “feature extractor,” taking to ; the map is commonly known as a “classifier” or “regression function,” mapping to . The problem of music content analysis is to build a system that meets the success criteria of a use case. A typical procedure for building an is to seek a way to reproduce all the ground truth of a recorded music dataset, defined as an indexed sequence of tuples sampled in some way from the population , i.e.,


where indexes the dataset. We call the ground truth of .

As a concrete example, take the Shazam music content analysis system Wang (2003).333 One can define its use case as follows. and are defined entirely from the digitised music recordings in the Shazam database. is defined as the set of music exactly as it appears in specific recordings. is defined by all 10-second audio recordings of elements of . is defined as a set of single tokens, each token consisting of an artist name, song title, album title, and other metadata. The Shazam music content analysis system maps a 10 second audio recording of to an element of consisting of many tuples of time-frequency anchors . The classifier then finds matching time-frequency anchors in a database of all time-frequency anchors from , and finally picks an element of . The success criteria might include making correct mappings (retrieving the correct song and artist name of the specific music heard) in adverse recording conditions, or increased revenue from music sales.

3 DeSPerF-based Music Content Analysis Systems

In the following subsections, we dissect DeSPerF-BALLROOM, first analysing its feature extraction, and then its classifier. This helps determine its sensitivities and limitations. The feature extraction of DeSPerF-based systems maps to , using spectral periodicity features (SPerF), first proposed by Pikrakis (2013). Its classifier maps to

using deep neural networks (DNN). In the case of DeSPerF-BALLROOM in Fig.

1, “Cha cha”, “Jive”, “Quickstep”, “Rumba”, “Tango”, “Waltz”.

3.1 Feature extraction

SPerF describe temporal periodicities of modulation sonograms. The hope is that SPerF reflect, or are correlated with, high-level musical characteristics such as tempo, meter and rhythm Pikrakis (2013). The feature extraction is defined by six parameters: . It takes an element of and partitions it into multiple signal segments of duration seconds (s) which hop by s. Each signal segment is divided into frames of duration s with a hop of s. From the ordered frames of a segment, a sequence of the first Mel-frequency cepstral coefficients (MFCCs) are computed, which we call the segment modulation sonogram


where is a vector of MFCCs extracted from the frame spanning time , and is the index of the last vector.

The MFCCs of a frame are computed by a modification of the approach of Slaney (1998)

. The magnitude discrete Fourier transform (DFT) of a Hamming-windowed frame is weighted by a “filterbank” of 64 triangular filters, the centre frequencies of which are spaced by one semitone. Figure

2 shows these filters. Each filter is weighted inversely proportional to its bandwidth. The lowest centre frequency is 110 Hz, and the highest is 4.43 kHz. Irregularities in filter shape at low frequencies arise from the uniform resolution of the DFT and the frame duration s. Finally, the discrete cosine transform (DCT) of the rectified filterbank output is taken, and the first MFCCs are selected to form . The period corresponding to the th MFCC is semitones, , and for . The first MFCC () is related to the mean energy over all 64 semitones. The third MFCC is related to the amount of energy of a component with a period of the entire filterbank. And the eleventh MFCC is related to the amount of energy of a component with a period of an octave.

Figure 2: MFCC filterbank used by the feature extraction of DeSPerF-based systems.

For each lag , define the two lagged modulation sonograms


starts from the beginning of the segment, and ends at the segment’s conclusion. A lag corresponds to a time-shift of s between the sonograms. Now, define the mean distance between these modulation sonograms at lag


where is the Euclidean norm, and stacks the ordered elements of sequence into a column vector. The sequence is then filtered, , where


and adapting around the end points of (shortening its support to a minimum of two). This sequence approximates the second derivate of . Finally, a SPerF of an audio segment is created by the sigmoid normalisation of :


where is the mean of and

is its standard deviation.

The output of the feature extraction is a sequence of SPerFs (9), each element of which is computed from one segment of the recording . In this case, the feature vocabulary is defined . The semantic rule is defined , where is the duration of the recording from in seconds. Together, these define . Table 3.1 summarises the six parameters of the feature extraction and their interpretation, as well as the values used in the system of Fig. 1.

Parameters of the feature extraction algorithm of the DeSPerF-based system of Fig. 1 Symbol Value Interpretation 10s Segment duration for computing SPerF from an element of . Limits time over which repetitive musical events can be detected.   1s Hop of each segment along an element of . The elements of become more redundant as . 100 ms Duration of frame in a segment in which MFCCs are computed. 5 ms Hop of each frame along a segment. The dimensionality of becomes larger as . 13 Number of MFCCs to compute for a frame, implemented in Slaney (1998). Limits resolution of spectral structures considered in SPerF computation. 800 Maximum lag to consider. is the time lag, . Note, . Limits time over which repeated musical events can be detected.

We can relate characteristics of a SPerF to low-level characteristics of a segment of a recording. For instance, if is periodic with period , then for the mean distance sequence should be small, should be large positive ( is convex around these lags), and should be close to one. If is such that is approximately constant over its support, then will be approximately zero, and . This is the case if a recording is silent or is not periodic within the maximum lag s. If is approximately zero at a lag , then is very negative, and there is a large distance between lagged modulation spectrograms around that lag.

Moving to higher-level characteristics, we can see that if the recording has a repeating timbral structure within the segment duration s, and if these repetitions occur within s, then should have peaks around those lags corresponding to the periodicity of those repetitions. The mean difference between lags of successive peaks might then be related to the mean tempo of music in the segment, or at least the periodic repetition of some timbral structure. If periodicities at longer time-scales exist in , then these might be relatable to the meter of the music in the segment, or at least a longer time-scale repetition of some timbral structure.

Figure 3: Examples of SPerF (9) extracted from BALLROOM Waltz recording Albums-Chrisanne1-08.

Figure 3 shows several SPerF extracted from recording of BALLROOM. The SPerF shows a short-term periodicity of about s, and about s between each of the first three highest peaks. The tempo of the music in this recording is about 180 beats per minute (BPM), and it has a triple meter. The few SPerF that do not follow the main trend are from the introduction of the recording, during which there are not many strong and regular onsets.

3.2 Classification

From a recording , extracts a sequence of SPerF, , where is a vectorised SPerF (9). The classifier maps to by a cascade of steps. At step , the th SPerF has been transformed iteratively by


where is a real matrix, is a real vector, and



produces a vector of posterior probabilities over

by a softmax output


where is an appropriately sized vector of all ones.

The cascade from to is also known as a deep neural network (DNN), with (12) being interpreted as posterior probabilities over the sample space defined by . If all elements of are the same, then the DNN has no “confidence” in any particular element of given the observation . If all but one element of are zero, then the DNN has the most confidence that points only to a specific element of . Finally, maps the sequence of posterior probabilities to by majority vote, i.e.,


where if is the element of associated with the largest value in , and zero otherwise.

The classifier of the system of Fig. 1 has layers, with the matrices and biases being: ; ; ; ; ; ; and finally . The set of parameters are found by using a training dataset and deep learning Deng and Yu (2014).444 For the system of Fig. 1 this is done by adapting the code produced by Salakhutdinov and Hinton (

), which trains a deep autoencoder for handwritten digit recognition. This code for DeSPerF is provided by A. Pikrakis.

Interpreting these parameters is not straightforward, save those at the input to the first hidden layer, i.e., and . The weights describe what information of a SPerF is passed to the hidden layers of the DNN. The th element of the vector is the input to the

th neuron in the first hidden layer. Hence, this neuron receives the product of the

th row of with . When those vectors point in nearly the same direction, this value will be positive; when they point in nearly opposite directions, this product will be negative; and when they are nearly orthogonal, the product will be close to zero. We might then interpret each row of as being exemplary of some structures in SPerF that the DNN has determined to be important for .

Figure 4(a) shows for DeSPerF-BALLROOM the ten rows of with the largest Euclidean norm. Many of them bear resemblance to the kinds of structures seen in the SPerF in Fig. 3. We can determine the bandwidth of the input to the first hidden layer by looking at the Hann-windowed rows of in the frequency domain. Figure 4(b) shows the sum of the magnitude responses of each row of for the system of Fig. 1. We see that the majority of energy of a SPerF transmitted into the hidden layers of its DNN is concentrated at frequencies below 10 Hz.

Figure 4: (a) The rows of with the largest Euclidean norm from the system of Fig. 1. (b) Combined magnitude response of all Hann windowed weights of the first layer of the DNN of the same system.

The magnitude of the product of the th row of and is proportional to the product of their Euclidean norms; and the bias of the th neuron – the th row of – pushes its output (10) to saturation. A large positive bias pushes the output toward and a large negative bias pushes it to . Figure 5(a) shows the Euclidean norms of all rows of for the classifier of the system of Fig. 1, sorted by descending norm. Figure 5(b) shows the bias of these neurons in the same order. We immediately see from this that the inputs to almost half of the neurons in the first hidden layer will have energies that are more than 20 dB below the neurons receiving the most energy, and that they also display very small biases. This suggests that about the half of the neurons in the first hidden layer might be inconsequential to the system’s behaviour. In fact, when we neutralise the 250 neurons in the first hidden layer of DeSPerF-BALLROOM having the smallest norm weights (by setting to zero the corresponding columns in ), its FoM is identical to Fig. 1. A possible explanation for this is that the DNN has more parameters than are necessary to map its input to its target.

(a) Norms of rows
(b) Biases
Figure 5: Characteristics of the parameters of the first layer of the DNN in the system of Fig. 1.

3.3 Sensitivities and limitations

From our analyses of the components of DeSPerF-based systems, we can infer the sensitivities and limitations of its feature extraction with respect to mapping to , and its classification mapping to . All of these limitations naturally restrict the , and success criteria of a use case to which the system can be applied. First, the MFCC filterbank (Fig. 2) means this mapping is independent of any information outside of the frequency band kHz. This could exclude most of the energy of bass drum kicks, cymbal hits and crashes, shakers, and so on. Figure 6 shows for segments of four recordings from BALLROOM that a large amount of energy can exist outside this band. If the information relevant for solving a problem of music content analysis is outside this band, then a DeSPerF-based system may not be successful.

(a) Chacha Media-106105.wav
(b) Jive Media-104116.wav
(c) Quickstep Media-105717.wav
(d) Samba Media-104104.wav
Figure 6: Magnitude spectra of two bars extracted from recordings in BALLROOM. Dashed lines show bandwidth of filterbank in Fig. 2.

Second, since the segment modulation sonograms (4) consist of only the first 13 MFCCs, their bandwidth is restricted to cycles per semitone, with a cepstral analysis resolution of not less than semitones.555 The period of the th DCT function is semitones. Spectral structures smaller than about semitones will thus not be present in a segment modulation sonogram. If the information relevant for solving a problem of music content analysis is contained only in spectral structures smaller than about 11 semitones (e.g., harmonic relationships of partials), then a DeSPerF-based system may not be successful.

Third, the computation of the mean distance between lagged modulation sonograms (7) destroys the quefrency information in the modulation sonograms. In other words, there exist numerous modulation sonograms (4) that will produce the same mean distance sequence (7). This implies that SPerF (9) are to a large extent invariant to timbre and pitch, and thus DeSPerF-based systems should not be sensitive to timbre and pitch, as long as the “important information” remains in the frequency band kHz mentioned above. This again restricts the kinds of problems of music content analysis for which a DeSPerF-based system could be successful.

Fourth, since the mean distance between lagged modulation sonograms (7) is uniformly sampled at a rate of Hz, the frequency of repetition that can be represented in a SPerF (9) is limited to the bandwidth Hz. Furthermore, all repetitions at higher frequencies will be aliased to that band. From our analysis of the front end of the DNN, we see from Fig. 4(b) that DeSPerF-BALLROOM is most sensitive to modulations in SPerF below 10 Hz. In fact, the FoM of DeSPerF-BALLROOM in Fig. 1 does not change when we filter all input SPerF with a zero-phase lowpass filter having a -3dB frequency of 10.3 Hz. This implies that DeSPerF-BALLROOM is not sensitive to SPerF modulations above 10 Hz, which entails periods in SPerF of 100 ms or more. Hence, DeSPerF-BALLROOM may have little sensitivity to periodicities in SPerF that are shorter than 100 ms.

Finally, since each segment of a recording is of duration s for the system of Fig. 1, then a SPerF can only contain events repeating within that duration. Since the largest lag considered is s, this limits the duration of the periodic structures a SPerF can capture. For instance, if a periodic pattern of interest is of duration of one bar of music, then a SPerF may only describe it if it repeats at least twice within 4 s. For two consecutive repetitions, this implies that the tempo must be greater than 120 BPM for a 4/4 time signature, 90 BPM for 3/4, and 180 BPM for 6/8. If a repeated rhythm occurs over two bars, then a SPerF may only contain it if at least four bars occur within 4 s, or as long as the tempo is greater than 240 BPM for a 4/4 time signature, 180 BPM for 3/4, and 360 BPM for 6/8.

3.4 Conclusion

We have now dissected the system in Fig. 1. We know that the DeSPerF-based systems are sensitive to temporal events that repeat within a specific frequency band and particular time window. This limits what DeSPerF-BALLROOM can be using to produce the FoM in Fig. 1. For instance, because of its lack of spectral resolution, it cannot be using melodies or harmonies to recognise elements of . Because it marginalises the quefrency information it cannot be discriminating based on instrumentation. It seems like the only knowledge a DeSPerF-based system can be using must be temporal in nature within a 10-second window. Before we can go further, we must develop an understanding of how DeSPerF-BALLROOM was trained and tested, and thus what Fig. 1 might mean. In the next section, we analyse the teaching and testing materials used to produce DeSPerF-BALLROOM, and its FoM in Fig. 1.

4 The Materials of Teaching and Testing

What is in the benchmark dataset BALLROOM? What problem does it pose? What is the task to reproduce its ground truth? What is the goal or hope of training a music content analysis system with it? We now analyse the BALLROOM dataset used to train and test DeSPerF-BALLROOM, and how it has been used to teach and test other music content analysis systems.

4.1 The contents and use of the Ballroom dataset

The dataset BALLROOM666Downloadable from consists of 698 half-minute music audio recordings downloaded in Real Audio format around 2004 from an on-line resource about Standard and Latin ballroom dancing Dixon et al. (2004). Each excerpt comes from the “beginning” of a music track, presumably ripped from a CD by an expert involved with the website. Dixon et al. (2004) call the labels of the dataset both “style” and “genre,” and allude to each excerpt being “reliably labelled” in one of eight ways. Table 4.1 shows the distribution of the number of excerpts over the labels of BALLROOM (we combine excerpts labeled “Viennese Waltz” and “Waltz” into one), as well as the 70/30 distribution of recordings we used for training and testing DeSPerF-BALLROOM in Fig. 1.

BALLROOM train/test partition used to train and test DeSPerF-BALLROOM in Fig. 1 Label Train Test Totals No. (%) No. (%) No. (%) ChaCha 78 (15.85) 33 (16.02) 111 (15.90) Jive 42 (8.54) 18 (8.74) 60 (8.60) Quickstep 58 (11.79) 24 (11.65) 82 (11.75) Rumba 69 (14.02) 29 (14.08) 98 (14.04) Samba 61 (12.40) 25 (12.14) 86 (12.32) Tango 61 (12.40) 25 (12.14) 86 (12.32) Waltz 123 (25.00) 52 (25.24) 175 (25.07) Total 492 (70.49) 206 (29.51) 698 (100)

All appearances of BALLROOM Dixon et al. (2004) in published classification experiments, along with the highest reported accuracy (normalised or not). Highest Acc. Reference Reported (%) Dixon et al. (2004) 96 Gouyon et al. (2004) 90.1 ISMIR2004 82 Gouyon and Dixon (2004),Gouyon (2005) 82.1 Lidy and Rauber (2005) 84.24 Peeters (2005) 90.4 Flexer et al. (2006) 66.9 Lidy (2006) 82 Lidy et al. (2007) 90.4 Lidy and Rauber (2008) 90.0 Holzapfel and Stylianou (2008) 85.5 Holzapfel and Stylianou (2009) 86.9 Pohle et al. (2009) 89.2 Lidy et al. (2010) 87.97 Mayer et al. (2010) 88 Seyerlehner (2010),Seyerlehner et al. (2010) 90 Peeters (2011) 96.1 Tsunoo et al. (2011) 77.2 Schindler and Rauber (2012) 67.3 Pikrakis (2013) 85 Sturm et al. (2014) 88.8

Thus far, BALLROOM has appeared in the evaluations of at least 24 conference papers, journal articles, and PhD dissertations Dixon et al. (2004); Flexer et al. (2006); Gouyon et al. (2004); Gouyon and Dixon (2004); Gouyon (2005); Holzapfel and Stylianou (2008, 2009); Lidy and Rauber (2005); Lidy (2006); Lidy et al. (2007); Lidy and Rauber (2008); Lidy et al. (2010); Mayer et al. (2010); Peeters (2005, 2011); Pikrakis (2013); Pohle et al. (2009); Schindler and Rauber (2012); Schlüter and Osendorfer (2011); Schnitzer et al. (2011, 2012); Seyerlehner (2010); Seyerlehner et al. (2010); Seyerlehner et al. (2012); Tsunoo et al. (2011). Twenty of these works use it in the experimental design Classify Sturm (2014c), which is the comparison of ground truth to the output of a music content analysis system. Table 4.1 shows the highest accuracies reported in the publications using BALLROOM this way. Four others Schlüter and Osendorfer (2011); Schnitzer et al. (2011, 2012); Seyerlehner et al. (2012) use BALLROOM in the experimental design Retrieve Sturm (2014c), which is the task of retrieving music signals from the training set given a query. The dataset was also used for the Rhythm Classification Train-test Task of ISMIR2004,777 and so sometimes appears as ISMIRrhythm.

4.2 Some tasks posed by the Ballroom dataset

Dixon et al. (2004) and Gouyon et al. (2004) pose one task of BALLROOM as to extract and learn “repetitive rhythmic patterns” from recorded music audio indicating the correct label. Motivating their work and the creation of the dataset, Dixon et al. (2004) propose the hypothesis: “rhythmic patterns are not randomly distributed amongst musical genres, but rather they are indicative of a genre.” While “rhythm” is an extraordinarily difficult thing to define Gouyon (2005), examples illuminate what Dixon et al. (2004) and Gouyon et al. (2004) intend. For instance, they give one “rhythmic pattern” typical of Cha cha and Rumba as one bar of three crochets followed by two quavers. Auditioning the Cha cha recordings reveals that this pattern does appear but that it can be quite difficult to hear through the instrumentation. In fact, this pattern is also apparent in many of the Tango recordings (notated in Fig. 7(a)). We find that major differences between recordings of the two labels are instrumentation, the use of accents, and syncopated accompaniment. It should be noted that much of the “rhythmic information” in excerpts of several labels of BALLROOM is contributed by instruments other than percussion, such as the piano and guitar in Cha cha, Rumba, Jive, Quickstep, and Tango; brass sections, woodwinds and electric guitar in Jive and Quickstep; and vocals and orchestra in Waltz.

(a) Tango excerpt Albums-Ballroom Classics4-07 and Media-105705
(b) Samba excerpt Albums-Latin Jam-06 and Media-103901
(c) Cha cha excerpts Albums-Latin Jam3-02 and Albums-Latin Jam4-06
(d) Rumba excerpt Media-105614
(e) Jive excerpt Albums-Fire-12
(f) Quickstep excerpt Albums-AnaBelen Veneo-11
Figure 7: Some of the characteristic patterns found in excerpts of BALLROOM.

Figure 7 shows examples of the rhythmic patterns appearing in BALLROOM. By “rhythmic pattern” we mean a combination of metrical structure, and relative timing and accents in a combination of voices. Many Cha cha recordings feature a two bar pattern with a strong cowbell on every beat, a guiro on one and three, and syncopated piano and/or brass with notes held over the bars (notated in Fig. 7(c)). On the other hand, Rumba recordings sound much slower and sparser than those of Cha cha, often featuring only guitar, clave, conga, shakers, and the occasional chime glissando (Fig. 7(d)). Rhythmic patterns heard in Jive and Quickstep recordings involve swung notes, notated squarely in Fig. 7(e) and Fig. 7(f). We find no Waltz recordings to have duple or quadruple meter.

Even though this dataset was explicitly created for the task of learning “repetitive rhythmic patterns,” it actually poses other tasks. In fact, a music content analysis system need not know one thing about rhythm to reproduce the ground truth in BALLROOM. One such task is the identification of instruments. For instance, bandoneon only appears in Tango recordings. Jive and Quickstep recordings often feature toms and brass, but the latter also has woodwinds. Rumba and Waltz recordings feature string orchestra, but the former also has chimes and conga. Cha cha recordings often have piano, along with guiro and cowbell. Finally, Samba recordings feature instruments that do not occur in any other recordings, such as pandeiro, repinique, whistles, and cuica. Hence, a system completely naive to rhythm could reproduce the ground truth of BALLROOM just by recognising instruments. This clearly solves a completely different problem from that posed by Dixon et al. (2004) and Gouyon et al. (2004). It is aligned more with the task posed by Lidy and Rauber (2008): “to extract suitable features from a benchmark music collection and to classify the pieces of music into a given list of genres.”

Figure 8: Distribution of the tempi of recordings (dots) in BALLROOM, assembled from onset data of Krebs et al. (2013). For each label: red solid line is median tempo; red dotted lines are half and double media tempo; upper and lower blue lines are official tempos for acceptable dance competition music by the World Sport Dance Federation (2014) (see Table 4.2); black dots are recordings in training dataset used to build DeSPerF-BALLROOM, and grey dots are recordings in the test dataset to compute its FoM in Fig. 1.

There exists yet another way to reproduce the ground truth of BALLROOM. Figure 8 shows the distribution of tempi. We immediately see a strong correlation between tempo and label. This was also noted by Gouyon et al. (2004). To illustrate the strength of this relationship, we construct a music content analysis system using simple nearest neighbour classification Hastie et al. (2009) with tempo alone. Figure 9(a) shows the FoM of this system using the same training and testing partition of BALLROOM as in Fig. 1. Clearly, this system produces a significant amount of ground truth, but suffers from a confusion predictable from Fig. 8 – which curiously does not appear in Fig. 1. If we modify annotated tempi by the following factors: Cha cha ; Jive ; Quickstep ; Rumba ; Samba ; Tango ; and Waltz (keeping Viennese Waltz the same), then the new system produces the FoM in Figure 9(b). Hence, “teaching” the system to “tap its foot” half as fast for some labels, and twice as fast for others, ends up reproducing a similar amount of ground truth to DeSPerF-BALLROOM in Fig. 1.

(a) Annotated Tempo
(b) Annotated Tempo with Multipliers
Figure 9: FoM of single nearest neighbour classifiers using just tempo for classification of excerpts in BALLROOM. Interpretation as in Fig. 1.

While such a foot-tapping system can reproduce the labels of BALLROOM, the particular problem it is actually solving is not aligned with that of detecting “repetitive rhythmic patterns” Dixon et al. (2004); Gouyon et al. (2004). The system of Fig. 9 is also not solving the problem posed by Lidy and Rauber (2008) as long as “genre” is not so strongly characterised by tempo. Of course, there are official tempos set by the World Sport Dance Federation (2014) for music to be acceptable for dance competitions (see Fig. 8 and Table 4.2), but arguably these rules are created to balance skill and competition difficulty, and are not derived from surveys of musical practice, and certainly are not proscriptions for the composition and performance of music in these styles. In fact, Fig. 8 shows several BALLROOM recordings do not satisfy these criteria.

Ballroom dance music tempo regulations of the World Sport Dance Federation (2014). Dance Style Tempo regulation Scale factor bars/min (beats/min) from mean tempo Cha-Cha-Cha 30 - 32 (120 - 128) 0.969 - 1.033 Jive 42 - 44 (168 - 176) 0.977 - 1.024 Quickstep 50 - 52 (200 - 208) 0.981 - 1.020 Rumba 25 - 27 (100 - 108) 0.963 - 1.040 Samba 50 - 52 (100 - 104) 0.981 - 1.020 Tango 31 - 33 (124 - 132) 0.970 - 1.032 Viennese Waltz 58 - 60 (174 - 180) 0.983 - 1.017 Waltz 28 - 30 (84 - 90) 0.967 - 1.036

Reproducing the ground truth of BALLROOM by performing any of the tasks above – discrimination by “rhythmic patterns,” instrumentation, and/or tempo – clearly involves using high level acoustic and musical characteristics. However, there are yet other tasks that a system might be performing to reproduce the ground truth of BALLROOM

, and ones with no clear relationship to music listening. For instance, if we use single nearest neighbour classification with features composed of only the variance and mean of a SPerF, and the number of times it passes through

, then with majority voting this system obtains a classification accuracy of over – far above that expected by random classification. It is not clear what task this system is performing, and how it relates to high-level acoustic and musical characteristics. Hence, this fourth approach to reproducing the ground truth of BALLROOM solves an entirely different problem from the previous three: “to classify the music documents into a predetermined list of classes” Lidy and Rauber (2005), i.e., by any means possible.

4.3 Conclusion

Though the explicit and intended task of BALLROOM is to recognise and discriminate between rhythmic patterns, we see that there actually exists many other tasks a system could be performing in reproducing the ground truth. The common experimental approach in music content analysis research, i.e., that used to produce the FoM in Fig. 1, has no capacity to distinguish between any of them. Just as in the case for the demonstrations of Clever Hans, were a music content analysis system actually recognising characteristic rhythms of some of the labels of BALLROOM, its FoM might pale in comparison to that of a system with no idea at all about rhythm (Fig. 9). Figure 1 gives no evidence at all for claims that DeSPerF-BALLROOM is identifying waltz by recognising its characteristic rhythmic patterns, tempo, instrumentation, and/or any other factor. From our analysis of DeSPerF-based systems, however, we can rule out instrument recognition since such knowledge is outside its purview. Nonetheless, what exact ask DeSPerF-BALLROOM is performing, the cause of Fig. 1, remains to be seen. The experiments in the next section shed light on this.

5 Seeking the “Horse” Inside the Music Content Analysis System

It is obvious that DeSPerF-BALLROOM knows something about the recordings in BALLROOM; otherwise its FoM in Fig. 1 would not be so significantly different from chance. As discussed in the previous section, this might be due to the system performing any of a number of tasks, whether by identifying rhythms, detecting tempo, or using the distributions of statistics with completely obscured relationships to music content. In this section, we describe several experiments designed to explain Fig. 1.

5.1 Experiment 1: The nature of the cues

We first seek the nature of the cues used by DeSPerF-BALLROOM to reproduce the ground truth. We watch how its behaviour changes when we modify the input along two orthogonal dimensions: frequency and time. We transform recordings of the test dataset by pitch-preserving time stretching, and time-preserving pitch shifting.888We use the rubberband library to achieve these transformations with minimal change in recording quality. We have auditioned several of the transformations to confirm. We seek the minimum scalings to make the system obtain a perfect classification accuracy, or one consistent with random classification (14.3%). To “inflate” the FoM, we take each test recording for which DeSPerF-BALLROOM is incorrect and transform it using a scale that increments by until the system is no longer incorrect. To “deflate” the FoM, we take each test recording for which DeSPerF-BALLROOM is correct and transform it using a scale that increments by until it is no longer correct. A pitch-preserving time stretching of scale increases the recording duration by , or decreases the tempo of the music in the recording (if it has a tempo) by . A time-preserving pitch shifting of scale increases all pitches in a recording by .

(a) Time-preserving frequency scaling
(b) Frequency-preserving time scaling
Figure 10: Changes to F-score of each label of BALLROOM as a function of the scaling of a transformation. Solid lines: deflation procedure. Dashed lines: inflation procedure. Note the difference in scales on the x-axis.

Figure 10 shows the results. As expected from our analysis in Section 3.3, time-preserving pitch shifting of the test recordings has little effect on the FoM, even up to changes of . In stark contrast is the effect of pitch-preserving time stretching, where the F-score of DeSPerF-BALLROOM in each label quickly decays for scales of at most . That scale is equivalent to lengthening or shortening a 30 s recording by only 1.5 s. Figure 11 shows the new tempi of the test recordings after these procedures, i.e., when the normalised classification accuracy is either perfect or no better than random. We see in most cases that the tempo changes are very small. The tempi of the 16 test recordings initially classified wrong move toward the median tempo of each class. Figure 11(b) shows that the opposite occurs in deflation for the 190 test recordings initially classified correctly.

(a) Inflation
(b) Deflation
Figure 11: Vertical lines point from original tempo of a BALLROOM test recording (grey dot) to its tempo after transformation by pitch-preserving time stretching of at most %. Interpretation as in Fig. 10.

The effects of these transformations clearly show that the nature of the cues DeSPerF-BALLROOM uses to reproduce ground truth is temporal, and that its performance is completely disrupted by minor changes in music tempo. The mean tempo change of the 12 BALLROOM Cha cha excerpts in Fig. 11(b) is an increase of BPM, which situate all of them on the cusp of the Cha cha cha competition dance tempo regulation (Table 4.2). Most of these transformed recordings are then classified by the system as Tango. In light of this, it is problematic to claim, e.g., DeSPerF-BALLROOM has such a high precision in identifying Cha cha (Fig. 1) because its internal model of Cha cha embodies “typical rhythmic patterns” of cha cha. Something else is at play.

5.2 Experiment 2: System dependence on the rate of onsets

The results of the previous experiment suggest that if the internal models of DeSPerF-BALLROOM have anything to do with rhythmic patterns, they are such that minor changes to tempo produce major confusion. We cannot say that the specific temporal cue used by DeSPerF-BALLROOM is tempo – however that is defined – alone or in combination with other characteristics, such as accent and meter. Indeed, comparing Fig. 1 with Fig. 9

motivates the hypothesis that DeSPerF-BALLROOM is using tempo, but reduces confusions by halving or doubling tempo based on something else. In this experiment, we investigate the inclinations of DeSPerF-BALLROOM to classify synthetic recordings exhibiting unambiguous onset rates. We synthesise each recording in the following manner. We generate one realisation of a white noise burst with duration 68 ms, windowed by half of a Hann window (attack and smooth decay). The burst has a bandwidth covering the bandwidth of the filterbank in DeSPerF-BALLROOM (Section

3.1). We synthesise a recording by repeating the same burst (no change in its amplitude) at a regular periodic interval (reciprocal of onset rate), and finally add white Gaussian noise with a power of 60 dB SNR (to avoid producing features that are not numbers). We create 200 recordings in total, with onset rates logarithmically spaced from 50 to 260 onsets per minute. Finally, we record the output of the system for each recording, as well as the mean DNN output posterior (12) over all segments.

(a) Output of DeSPerF-BALLROOM for generated recordings exhibiting different onset rates
(b) Estimated conditional probability distribution of onset rate/tempo conditioned on classification/label
Figure 12: Results from testing DeSPerF-BALLROOM using synthetic recordings having different onset rates. (a) A black circle is a recording with an onset rate (y-axis), classified by DeSPerF-BALLROOM with mean posterior (legend). We plot the halves and doubles of the onsets as well as grey circles of the same size. (b) Parzen window estimate of probability distributions of onset rate conditioned on system output (black), and tempo of training excerpts conditioned on label (grey) with halving and doubling.

Figure 12 shows the results of this experiment. Each black circle in Fig. 12(a) represents a recording with some onset rate (y-axis), classified by the system in some way (grouped in classes and ordered by increasing onset rate) with a mean posterior (size of circle). Figure 12(b) shows an estimates of the conditional distributions of onset rate given the classification by using Parzen windowing with the posteriors as weights. We also show the estimate of the conditional distribution of tempo given the BALLROOM label from the training data, and include a halving and doubling of tempo (gray). We can clearly see ranges of onset rates to which the system responds confidently in its mapping. Comparing the two conditional distributions, we see some that align very well. All octaves of the tempo of Jive, Quickstep and Tango overlap the ranges of onsets that are confidently so classified by DeSPerF-BALLROOM. For Samba, however, only the distribution of half the tempo overlaps the Samba-classified synthetic recordings at low onset rates; for Cha cha and Rumba, it is the distributions of double the tempo that overlap the Cha cha- or Rumba-classified synthetic recordings at high onset rates. These are some of the tempo multiples used to produce the FoM in Fig. 9(b) by single nearest neighbour classification. These results point to the hypothesis that DeSPerF-BALLROOM is using a cue to “hear” an input recording at a “tempo” that best separates it from the other labels. Of interest is whether that cue has to do with meter and/or rhythm, and how the system’s internal models reflect high level attributes of the styles in BALLROOM. We explore these in the next three experiments.

5.3 Experiment 3: System output dependence on the rate of onsets and periodic stresses

In this experiment, we watch how the system’s behaviour changes when the input exhibits repeating structures that have a period encompassing several onsets. We perform this experiment in the same manner as the previous one. We synthesise each recording in the same way, but stress every second, third or fourth repetition of the white noise burst. We create a stress in two different ways. In the first, each stressed onset has an amplitude four times that of an unstressed onset. In the second, all unstressed onsets are produced by a highpass filtering of the white noise burst (passband frequency 1 kHz). We create 200 recordings in total for each of the stress periods, and each kind of stress, with onset rates logarithmically spaced from 50 to 260 onsets per minute. Finally, we record the output of the system for each recording, as well as the mean DNN output posterior (12) for all segments.

(a) System output for generated recordings exhibiting different onset rates and stress periods
(b) Estimated conditional probability distribution of onset rate/tempo conditioned on classification/label and stress period
Figure 13: Results from testing DeSPerF-BALLROOM using recordings generated with different onset rates and stress periods (legend). Compare with Fig. 12. Horizontal dashed lines signify changes in class across stress period. As in Fig. 12(a), we show halving and doubling of onset rates.

Figure 13 shows results quite similar to the previous experiment. The results of both stress kinds are nearly the same, so we only not show one of them. The dashed horizontal lines in Fig. 13(a) show some classifications of recordings with the same onset rate are different across the stress periods we test. Figure 13(b) shows the appearance of density in the conditional probability distribution of the onset rate in Waltz around the tempo distribution observed in the training dataset of label Waltz (80-90 BPM), which is not apparent in Fig. 12(b). Could these changes be due to the system preferring Waltz for a recordings exhibiting a stress period of 3? Figure 14 shows this to not be the case. We see no clear indication that DeSPerF-BALLROOM favours particular classes for each stress period independent of the onset rate for the different kinds of stresses. For instance, we see no strong inclination of DeSPerF-BALLROOM to classify recordings with a stress period of 3 as Waltz. Most classifications are the same across the stress periods.

(a) Stress by amplitude
(b) Stress by amplitude and bandwidth
Figure 14: Dependency of system output (y-axis) on stress period for two different kinds of stresses across all onset rates tested. The weight of a line shows the proportion observed of a specific transition in classification for recordings generated with the same onset rate. The transition pattern observed most in both cases (24 times) is Cha cha (stress period 2), Cha cha (3), Cha cha (4).

5.4 Experiment 4: Manipulation of the tempo

The previous experiments clearly show the inclination of DeSPerF-BALLROOM to classify in confident ways recordings exhibiting specific onset rates independent of repeated structures of longer periods. This leads to the prediction that any input recording can be time-stretched to elicit any desired response from the system, e.g., we can make the system choose “Tango” by time stretching any input recording to have a tempo of 130 BPM. To test this prediction, we first observe how the system output changes when we apply frequency-preserving time stretching to the entire BALLROOM test dataset with scales from to , incrementing by steps of size . For a recording with a tempo of 120 bpm, a scaling of amounts to a change of bpm. We then search for tempi where DeSPerF-BALLROOM classifies all test recordings the same way.

Figure 15: The percentage of the BALLROOM test dataset classified by the system in a number of different ways (numbered) as a function of the maximum scale of frequency-preserving time stretching. For example, with scalings in , half of all test recordings are classified 3 different ways.

Figure 15 shows the percentage of the test dataset classified in a number of different ways as a function of the amount of frequency-preserving time stretching. With scalings between , DeSPerF-BALLROOM classifies about 80% of the test dataset with 3-6 different classes. With scalings between , it classifies 90% of the test recordings into 3-7 different classes. Figure 16 shows the confusion table of DeSPerF-BALLROOM tested with all time-stretched test recordings. We see most Waltz recordings (66%) are classified as Waltz; however, the majority of recordings of all other labels are classified other ways. In the case of the Rumba recordings, DeSPerF-BALLROOM classifies over 20% of them as Waltz when time stretched by at most a scale of . This entails reducing their median tempo from 100 BPM (Fig. 8) to 87, and increasing it up to 117 BPM.

Figure 16: As in Fig. 1, but for all test recordings time-stretched with 32 scales in . For instance, about 47% of all Cha cha recordings time stretched by 32 scales in are classified as Cha cha, but about 6.5% of them are classified as Waltz.

We do not find tempi at which the system outputs the same specific class for all test recordings. However, we do see the following outcomes, in order of increasing tempo:

  1. DeSPerF-BALLROOM chooses Rumba for all Cha cha, Rumba, and Tango recordings time stretched to have a tempo in the range BPM;

  2. DeSPerF-BALLROOM chooses Tango for all Cha cha, Jive and Tango recordings time stretched to have a tempo in the range BPM;

  3. DeSPerF-BALLROOM chooses Waltz for all Cha cha and Rumba recordings time stretched to have a tempo in the range BPM;

  4. DeSPerF-BALLROOM chooses Samba for all Cha cha and Jive recordings time stretched to have a tempo in the range BPM;

  5. DeSPerF-BALLROOM chooses Waltz for all Cha cha and Tango recordings time stretched to have a tempo in the range BPM;

  6. DeSPerF-BALLROOM chooses Cha cha for all Jive and Quickstep recordings time stretched to have a tempo in the range BPM.

Clear from this is that all Cha cha test recordings are be classified by DeSPerF-BALLROOM as Rumba, Samba, Tango or Waltz simply by changing their tempo to be in specific ranges. This is strong evidence against the claim that the very high precision of DeSPerF-BALLROOM in Cha cha (Fig. 1) is caused by its ability to recognise rhythmic patterns characteristic of Cha cha.

5.5 Experiment 5: Hiring the system to compose

The previous experiments have shown the strong reliance of DeSPerF-BALLROOM upon cues of a temporal nature, its inclinations toward choosing particular classes for recordings exhibiting different onset rates (one basic form of tempo), the seeming class-irrelevance of larger scale stress periods (one basic form of meter), and how it can be made to choose four other classes for any Cha cha test recording simply by changing only its tempo. It is becoming more apparent that, though its FoM in Fig. 1 is excellent, we do not expect DeSPerF-BALLROOM to be of any use for identifying whether the music in any recording has a particular rhythmic pattern that exists in BALLROOM – unless one defines “rhythmic pattern” in a very limited way, or claims the labels of BALLROOM are not what they seem, e.g., “Samba” actually means “any music having a tempo of 100-104 BPM.”

We now consider whether DeSPerF-BALLROOM is able to help compose rhythmic patterns characteristic of the labels in BALLROOM. We address this in the following way. We randomly produce a large number of rhythmic patterns, and synthesise recordings from them using real audio samples of instruments typical to recordings in BALLROOM

. More specifically, for each of four voices, we generate a low-level beat structure by sampling a Bernoulli random variable four times for each beat in each measure (semiquaver resolution). The parameter of the Bernoulli random variable for an onset is

, where a

is an onset. Each onset is either stressed or unstressed with equal probability. We select a tempo sampled from a uniform distribution over a specific range, then synthesise repetitions of the two measures in each voice to make a recording of 15 s. Finally, we select as most class-representative those recordings for which the classification of DeSPerF-BALLROOM is the most confident (

12), and inspect how the results exemplify rhythms in BALLROOM

. This is of course a brute force approach. We could use more sophisticated approaches to generate compositions, such as Markov chains, e.g.,

Pachet (2003); Thomas et al. (2013); but the aim of this experiment is not to produce interesting music, but to see whether the models of DeSPerF-BALLROOM can confidently detect rhythmic patterns characteristic to BALLROOM.

To evaluate the internal model of the system for Jive, we perform the above with audio samples of instruments typical to Jive: kick, snare, tom, and hat. Furthermore, we restrict the meter to be quadruple, make sure a stressed kick occurs on the first beat of each measure, and set the tempo range to BPM. These are conditions most advantageous to the system, considering what it has learned about Jive in BALLROOM. Of 6020 synthetic recordings produced this way, DeSPerF-BALLROOM classifies 447 with maximum confidence. Of these, 128 are classified as Jive, 122 are classified as Waltz, 79 as Tango, and the remainder in the four other classes. Figure 17 shows four of them selected at random. Even with these favourable settings, it is difficult to hear in any of the recordings similarity to the rhythmic patterns of which they are supposedly representative. We find similar outcomes for the other labels of BALLROOM. In general, we find it incredibly difficult to coax anything from DeSPerF-BALLROOM that resembles the rhythmic patterns in BALLROOM.

(a) Pattern 3338, classified with maximum confidence as Jive
(b) Pattern 2982, classified with maximum confidence as Quickstep
(c) Pattern 5519, classified with maximum confidence as Tango
(d) Pattern 2684, classified with maximum confidence as Waltz
Figure 17: Some rhythmic patterns classified most confidently in the given class by DeSPerF-BALLROOM.

6 Discussion

To explain Fig. 1, to seek the cause of the behaviour of DeSPerF-BALLROOM, we have dissected the system, analysed its training and testing dataset, and conducted several experiments. We see from the first experiment that the performance of DeSPerF-BALLROOM relies critically on cues of a temporal nature. The results of the second experiment reveal the inclinations of the system to confidently label in particular ways recordings that all exhibit, arguably, the same and most simple rhythmic pattern but with different onset rates. It also suggests that DeSPerF-BALLROOM is somehow adjusting its perception of tempo, of something highly correlated with tempo, for recordings of some labels in BALLROOM. The results of the third experiment show how little the system’s behaviour changes when we introduce longer-period repetitions in the recordings – a basic form of meter. The independent variable of onset rate appears to trump the influence of the stress pattern. The fourth experiment shows how the system selects many classes for music exhibiting the same repetitive rhythmic patterns, just with different tempi. We also find some narrow tempo ranges in which the system classifies in the same way all test recordings of one label. Finally, the last experiment shows that the system confidently produces rhythmic patterns that do not clearly reflect those heard in BALLROOM. All of this points to the conclusion that Fig. 1 is not caused by, and does not reflect, an intelligence about rhythmic patterns. The task DeSPerF-BALLROOM is performing is not the identification of rhythmic patterns heard in music recordings. Instead, Fig. 1 appears to be caused by the exploitation of some cue highly related to the confounding of tempo with label in BALLROOM, which the system has through no fault of its own learned from its teaching materials. In summary, DeSPerF-BALLROOM is identifying rhythmic patterns as well as Clever Hans was solving arithmetic.

One can of course say Table 4.2 is proof that tempo is extremely relevant for ballroom dance music classification. Supported by such formal rules, as well as the increased reproduction of ground truth observed in BALLROOM when tempo is used as a feature, Dixon et al. (2004) write, “tempo is one the most important features in determining dance genre” Dixon et al. (2003); Gouyon et al. (2004). Hence, one is tempted to claim that though the system uses some cue highly related to tempo, it makes little difference. There are four problems with this claim. First, one can argue that tempo and rhythm are intimately connected, but in practice they seem to be treated separately. For instance, the rhythmic pattern features proposed by Dixon et al. (2004) are tempo invariant. In their work on measuring rhythmic similarity, Holzapfel and Stylianou (2008) use dynamic time warping to compare rhythms independent of tempo (further refined in Holzapfel and Stylianou (2009)). Second, Table 4.2 describes eligibility for music to be allowed in a competition of particular dance styles, and not for music or its rhythmic patterns to be given a stylistic label. Indeed, Fig. 8 shows several recordings in BALLROOM break the criteria set forth by World Sport Dance Federation (2014). Third, this claim moves the goal line after the fact. Section 4 shows that though BALLROOM poses many different tasks, the task originally intended by Dixon et al. (2004) is to extract and learn “repetitive rhythmic patterns” from recorded music audio, and not to classify ballroom dance music. Finally, the claim that tempo is extremely relevant for ballroom dance music classification works against the aims of developing music content analysis systems. If the information or composition needs of a user involve rhythmic patterns characteristic of ballroom dance music styles, then DeSPerF-BALLROOM will contribute little of value despite its impressive and human-like FoM in Fig. 1. The hope is the DeSPerF-BALLROOM has learned to model rhythmic patterns. The reality is that it is not recognising rhythmic patterns.

Automatically constructing a working model, or theory, that explains a collection of real-world music examples has been called “a great intellectual challenge” with major repercussions Dubnov et al. (2003). As observed by Eigenfeldt et al. 2013; 2013a; 2013b, applying a machine learning algorithm to learn relationships among and rules of the music in a dataset (corpus) is in the most abstract sense automated meta-creation: a machine learns the “rules from which to generate new art” Eigenfeldt et al. (2014)

. This same sentiment is echoed in other domains, such as computer vision

Dosovitskiy et al. (2014); Nguyen et al. (2015), written language Shannon and Weaver (1998); Ghedini et al. (2015), and the recent “zero resource speech challenge,”999 in which a machine listening system must learn basic elements of spoken natural language, e.g., phonemes and words. In fact, the automatic modelling of music style is a pursuit far older and more successful in the symbolic domain than in the domain of audio signal processing Hiller and Isaacson (1959); Cope (1991); Roads (1996); Dubnov et al. (2003); Pachet (2003, 2011); Collins (2010); Argamon et al. (2010); Dubnov and Surges (2014); Eigenfeldt (2012); Eigenfeldt and Pasquier (2013b); Eigenfeldt (2013). One reason for the success of music style emulation in the symbolic domain is that notated music is automatically on a plane more meaningful than samples of an audio signal, or features derived from such basic representations. It is closer to “the musical surface” Dubnov et al. (2003); Dubnov and Surges (2014). In his work on the algorithmic emulation of electronic dance music, Eigenfeldt (2013) highlights some severe impediments arising from working with music audio recordings: reliability, interpretability, and usability. They found that the technologies offered so far by content-based music information retrieval do not yet provide suitably rich and meaningful representations from which a machine can learn about music. Eigenfeldt (2013) thus bypasses these problems by sacrificing scalability, and approaching the automated style modelling of electronic dance music in the symbolic domain by first transcribing by hand a corpus of dance music Eigenfeldt and Pasquier (2013b); Eigenfeldt (2013).

Another reason why the pursuit of style detection, understanding, and emulation in the symbolic domain has seen substantial success whereas that in the audio domain has not is the relevance of evaluation practices in each domain. A relevant evaluation of success toward the pursuit of music understanding is how well a system can create “new art” that reflects its training Eigenfeldt et al. (2014). As with the “continuator” Pachet (2003) – where a computer agent “listens” to the performance of a musician, and then continues where the musician leaves off – the one being emulated becomes the judge. This is also the approach used by Dannenberg et al. (1997) in their music style recognition system, which sidesteps the thorny issue of having to define what is being emulated or recognised. Unfortunately, much research in developing music content analysis systems has approached the evaluation of such technologies in ways that, while convenient, widely accepted, and precise, are not relevant. In essence, the proof of good pudding is in its eating, not in the fact that its ingredients were precisely measured.

Among the nearly 500 publications about the automatic recognition of music genre or style Sturm (2014c), only a few works evaluate the internal models learned by a system by looking at the music it composes. Cruz and Vidal-Ruiz (2003) construct a system that attempts to learn language models from notated music melodies in a variety of styles (Gregorian, Baroque, Ragtime). They implement these models as finite state automata, and then use them to generate exemplary melodies in each style. As in Fig. 17, Cruz and Vidal-Ruiz (2003) provide examples of the produced output, and reflect on the quality of the results (which they expand upon in a journal article Cruz and Vidal (2008)). In the audio domain, Sturm (2012) employs a brute force approach to exploring the sanity of the learned models of two different state-of-the-art music content analysis systems producing high FoM in a benchmark music genre dataset. He generates random recordings from sample loops, has each system classify them, and keeps only those made with high confidence. From a listening experiment, he finds that people cannot identify the genres of those representative excerpts.101010It is entirely likely that I have missed relevant references from the symbolic domain for genre/style recognition/emulation. In a completely different domain, similar approaches have recently been used to test the sanity of the internal models of high-performing image content recognition systems Szegedy et al. (2014); Dosovitskiy et al. (2014); Nguyen et al. (2015).

The results of our analysis and experiments with DeSPerF-BALLROOM clearly do not support rejecting the hypothesis that this system is a “horse” with respect to identifying rhythmic patterns; but what about the DeSPerF-based systems that reproduced the most ground truth in the 2013 MIREX edition of the “Audio Latin Music Genre classification task” (ALGC)? Can we now conclude that their winning performance was not caused by “musical intelligence,” but by the exploitation of some tempo-like cue? In the case of the LMD dataset used in ALGC, the task appears to be “musical genre classification” Silla et al. (2008). Silla et al. (2008) reference Fabbri (1999) to define “genre:” “a kind of music, as it is acknowledged by a community for any reason or purpose or criteria.” In particular to LMD, the community acknowledging these “kinds” of music was represented by two “professional teachers with over ten years of experience in teaching ballroom and Brazilian cultural dances” Silla et al. (2008). These professionals selected commercial recordings of music “that they judged representative of a specific genre, according to how that musical recording is danced.” The appendix to Silla et al. (2008) gives characteristics of the music genres in LMD, many of which should be entirely outside the purview of any audio-based system, e.g., aspects of culture, topic, geography, and dance moves. We cannot say what the cue in LMD is – and tempo currently does not appear to be a confound Esparza et al. (2014) – but the default position in light of the poor evidence contributed by the amount of ground truth reproduced must be that the system is not yet demonstrated to possess the “intelligence” relevant for a specific task. Valid experiments are needed to claim otherwise Urbano et al. (2013).

The task of creating Fig. 7 was laborious. Identifying these rhythmic patterns relies on experience in listening to mixtures of voices and separating instruments, listening comparatively to collections of music recordings, memory, expectation, musical practice, physicality, and so on. Constructing an artificial system that can automatically do something like this for an arbitrarily large collection of music audio recordings will surely produce major advances in machine listening and creativity Dubnov et al. (2003). In proportion, evidence for such abilities must be just as outstanding – much more so than achieving 100% on the rather tepid multiple choice exam. It is of course the hope that DeSPerF-BALLROOM has learned from a collection of music recordings general models of the styles tersely represented by the labels; and indeed, “One of machine learning’s main purposes is to create the capability to sensibly generalize” Dubnov et al. (2003). The results in Fig. 1 just does not provide valid evidence for such a conclusion; it does not even provide evidence that such capabilities are within reach. Similarly, we are left to question all results in Table 4.1: which of these are “horses” like DeSPerF-BALLROOM, and which are solutions, for identifying rhythmic patterns? What problem is each actually solving, and how is it related to music? Which can be useful for connecting users with music and information about music? Which can facilitate creative pursuits? Returning to the formalism presented in Section 2, for what use cases can each system actually benefit? One might say that any system using musically interpretable features is likely a solution. For instance, the features employed by Dixon et al. (2004) are essentially built from bar-synchronised decimated amplitude envelopes, and are interpretable with respect to the rhythmic characteristics of the styles in BALLROOM. However, as seen at the end of Section 3.1, SPerF are musically interpretable as well. One must look under the hood, and design, implement and analyse experiments that have the validity to test to the objective.

Ascribing too much importance to the measurement and comparison of the amounts of ground truth reproduced – a practice that appears in a vast majority of publications in music genre recognition Sturm (2013, 2014c) – is an impediment to progress. Consider a system trained and tested in BALLROOM that has actually learned to recognise rhythmic patterns characteristic of waltz, but has trouble with any rhythmic patterns not in triple meter. Auditioning BALLROOM demonstrates that all observations not labeled Waltz have a duple or quadruple meter. If such a system correctly classifies all Waltz test recordings based on rhythmic patterns, but chooses randomly for all others, we expect its normalised accuracy to be about %. This is double that expected of a random selection, but far below the accuracy seen in Fig. 1

. It is thus not difficult to believe the low-performing system would be tossed for DeSPerF-BALLROOM, or even let pass through peer review, even though it is actually the case that the former system is addressing the task of rhythmic pattern recognition, while the latter is just a “horse.” Such a warning has been given before: “an improved general music similarity algorithm might even yield lower accuracies”

Pohle et al. (2009). No system should be left behind because of invalid experiments.

Many interesting questions arise from our work. What will happen when SPerF are made tempo-invariant? What will happen if the tempo confounding in BALLROOM is removed? One can imagine augmenting the training dataset by performing many different pitch-preserving time stretching transformations; or of making all recordings have the same tempo. Will the resulting system then learn to identify repetitive rhythmic patterns? Or will it only appear so by use of another cue? Another question is what the DNN contributes? In particular, DNN have been claimed to be able to learn to “listen” to music in a hierarchical fashion Hamel and Eck (2010); Humphrey et al. (2013); Deng and Yu (2014). If a DNN-based system is actually addressing the task of identifying rhythmic patterns, how does this hierarchical listening manifest? Is it over beats, figures, and bars? This also brings up the question, “why learn at all?” Should we expect the system to acquire what is readily available from experts? Why not use expert knowledge, or at least leverage automated learning with an expert-based system? Finally, the concepts of meta-creation motivates new evaluation methods Thomas et al. (2013), both in determining the sanity of a system’s internal models, but also in meaningfully comparing these models. Meta-creation essentially motivates the advice of hiring the system to do the accounting in order to reveal the “horse.” Valid evaluation approaches will undoubtedly require more effort on the part of the music content analysis system developer, but validity is simply non-negotiable.

7 Conclusion

The first supplement in Pfungst (1911) describes the careful and strict methods used to teach the horse Clever Hans over the course of four years to read letters and numerals, and then to solve simple problems of arithmetic. When Clever Hans had learned these basics, had time “to discover a great deal for himself,” and began to give solutions to unique problems that were not part of his training, his handler believed “he had succeeded in inculcating the inner meaning of the number concepts, and not merely an external association of memory images with certain movement responses” Pfungst (1911). Without knowing the story of Clever Hans, it seems quite reasonable to conclude that since it is highly unlikely for DeSPerF-BALLROOM to achieve the FoM in Fig. 1 by luck alone, then it must have learned rhythmic patterns in the recorded music in BALLROOM. As in the case of Clever Hans’s tutor, there are four problems with such a conclusion.

First, this unjustifiably anthropomorphises the system of Fig. 1

. For instance, someone who does not know better might believe that a stereo system must be quite a capable musician because they hear it play music. There is no evidence that the criteria and rules used by this system – the ones completely obfuscated by the cascade of compressed affine linear transformations described in Section 3.2 – are among those that a human uses to discriminate between and identify style in music listening. Second, one makes the assumption that the semantics of the labels of the dataset refer to some quality called “style” or “rhythmic pattern.” This thus equates, “learning to map statistics of a sampled time series to tokens,” and “learning to discriminate between and identify styles that manifest in recorded music.” Third, underpinning this conclusion is the assumption that the tutoring was actually teaching the skills desired. In the case of the system of Fig.

1, the tutoring actually proceeds by asking the DNN a question (inputting an element of with ground truth ), comparing its output to the target (the standard basic vector with a in the row associated with and zero everywhere else), then adapting all of its parameters in an optimal direction toward that target, and finally repeating. While this “pedagogy” is certainly strict and provably optimal with respect to specific objectives Deng and Yu (2014); Hastie et al. (2009), its relationship to “learning to discriminate between and identify styles” is not clear. Repeatedly forcing Hans to tap his hoof twice is not so clearly teaching him about the “inner meaning of the number concept” 2. Fourth, and most significantly, this conclusion implicitly and incorrectly assumes that the results of Fig. 1 have only two possible explanations: luck or “musical intelligence.” The story of Clever Hans shows just how misguided such a belief can be.

The usefulness of any music content analysis system depends on what task it is actually performing, what problem it is actually solving. BALLROOM at first appears to explicitly pose a clear problem; but we now see that there exists several ways to reproduce its ground truth – each of which involves a different task, e.g., rhythmic pattern recognition, tempo detection, instrument recognition, and/or ones that have no concrete relationship to music. We cannot tell which task DeSPerF-BALLROOM is performing just from looking at Fig. 1. While comparing the output of a music content analysis system to the ground truth of a dataset is convenient, it simply does not distinguish between “horses” and solutions Sturm (2013, 2014a). It does not produce valid evidence of intelligence. That is, we cannot know whether the system is giving the right answers for the wrong reasons. Just as Clever Hans appeared to be solving problems of arithmetic – what can be more explicit than asking a horse to add 1 and 1? – the banal task he was actually performing, unbeknownst to many save himself, was “make the nice man feed me.” The same might be true, metaphorically speaking, for the systems in Table 4.1.

Thank you to Aggelos Pikrakis, Corey Kereliuk, Jan Larsen, and the anonymous reviewers. I dedicate this article to the memory of Alan Young (1919-2016), principal actor of the TV show, “Mr. Ed.”


  • (1)
  • Argamon et al. (2010) S. Argamon, K. Burns, and S. Dubnov (Eds.). 2010. The Structure of Style: Algorithmic Approaches to Understanding Manner and Meaning. Springer.
  • Aucouturier (2009) J.-.J. Aucouturier. 2009. Sounds like teen spirit: Computational insights into the grounding of everyday musical terms. In Language, Evolution and the Brain: Frontiers in Linguistic Series, J. Minett and W. Wang (Eds.). Academia Sinica Press.
  • Bergstra et al. (2006) J. Bergstra, N. Casagrande, D. Erhan, D. Eck, and B. Kégl. 2006. Aggregate features and AdaBoost for music classification. Machine Learning 65, 2-3 (June 2006), 473–484.
  • Casey et al. (2008) M. Casey, R. Veltkamp, M. Goto, M. Leman, C. Rhodes, and M. Slaney. 2008. Content-based Music Information Retrieval: Current Directions and Future Challenges. Proc. IEEE 96, 4 (Apr. 2008), 668–696.
  • Collins (2010) Nick Collins. 2010. Introduction to Computer Music. Wiley.
  • Cope (1991) D. Cope. 1991. Computers and Musical Style. Oxford University Press.
  • Cruz and Vidal-Ruiz (2003) P.P. Cruz and E. Vidal-Ruiz. 2003. Modeling musical style using grammatical inference techniques: a tool for classifying and generating melodies. In Proc. WEDELMUSIC. 77–84. DOI: 
  • Cruz and Vidal (2008) Pedro P. Cruz and Enrique Vidal. 2008. TWO GRAMMATICAL INFERENCE APPLICATIONS IN MUSIC PROCESSING. Applied Artificial Intell. 22, 1/2 (2008), 53–76.
  • Cunningham et al. (2012) S. J. Cunningham, D. Bainbridge, and J. S. Downie. 2012. The Impact of MIREX on Scholarly Research. In Proc. ISMIR. 259–264.
  • Dannenberg et al. (1997) R. B. Dannenberg, B. Thom, and D. Watson. 1997. A Machine Learning Approach to Musical Style Recognition. In Proc. ICMC. 344–347.
  • Deng and Yu (2014) L. Deng and D. Yu. 2014. Deep Learning: Methods and Applications. Now Publishers.
  • Dixon et al. (2004) S. Dixon, F. Gouyon, and G. Widmer. 2004. Towards characterisation of music via rhythmic patterns. In Proc. ISMIR. 509–517.
  • Dixon et al. (2003) S. Dixon, E. Pampalk, and G. Widmer. 2003. Classification of Dance Music by Periodicity Patterns. In Proc. ISMIR.
  • Dosovitskiy et al. (2014) Alexey Dosovitskiy, Jost Tobias Springenberg, and Thomas Brox. 2014. Learning to Generate Chairs with Convolutional Neural Networks. CoRR abs/1411.5928 (2014).
  • Downie et al. (2010) J. Downie, Andreas Ehmann, Mert Bay, and M. Jones. 2010. The Music Information Retrieval Evaluation eXchange: Some Observations and Insights. In Advances in Music Information Retrieval, Zbigniew Ras and Alicja Wieczorkowska (Eds.). Springer Berlin / Heidelberg, 93–115.
  • Downie (2004) J. S. Downie. 2004. The scientific evaluation of music information retrieval systems: Foundations and future. Computer Music Journal 28, 2 (2004), 12–23.
  • Downie (2008) J. S. Downie. 2008. The music information retrieval evaluation exchange (2005–2007): A window into music information retrieval research. Acoustical Science and Tech. 29, 4 (2008), 247–255.
  • Dubnov et al. (2003) S. Dubnov, G. Assayag, O. Lartillot, and G. Bejerano. 2003. Using machine-learning methods for musical style modeling. Computer 36, 10 (2003), 73–80.
  • Dubnov and Surges (2014) S. Dubnov and G. Surges. 2014. Digital Da Vinci. Springer, Chapter Delegating Creativity: Use of Musical Algorithms in Machine Listening and Composition, 127–158.
  • Eigenfeldt (2012) A. Eigenfeldt. 2012. Embracing the Bias of the Machine: Exploring Non-Human Fitness Functions. In

    Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment

  • Eigenfeldt (2013) A. Eigenfeldt. 2013. The Human Fingerprint in Machine Generated Music. In Proc. xCoAx.
  • Eigenfeldt and Pasquier (2013a) A. Eigenfeldt and P. Pasquier. 2013a. Considering vertical and horizontal context in corpus-based generative electronic dance music. In Proc. Int. Conf. Computational Creativity.
  • Eigenfeldt and Pasquier (2013b) A. Eigenfeldt and P. Pasquier. 2013b. Evolving Structures for Electronic Dance Music. In

    Proc. Conf. Genetic and Evolutionary Computation

    . 319–326.
  • Eigenfeldt et al. (2014) A. Eigenfeldt, M. Thorogood, J. Bizzocchi, P. Pasquier, and T. Calvert. 2014. Video, Music and Sound Metacreation. In Proc. xCoAx. Porto, Portugal.
  • Esparza et al. (2014) T. M. Esparza, J. P. Bello, and E. J. Humphrey. 2014. From genre classification to rhythm similarity: Computational and musicological insights. J. New Music Research (2014).
  • Fabbri (1999) F. Fabbri. 1999. Browsing Musical Spaces: Categories and the musical mind. In Proc. Int. Association for the Study of Popular Music.
  • Flexer et al. (2006) A. Flexer, F. Gouyon, S. Dixon, and G. Widmer. 2006. Probabilistic Combination of Features for Music Classification. In Proc. ISMIR. Victoria, BC, Canda, 111–114.
  • Fu et al. (2011) Z. Fu, G. Lu, K. M. Ting, and D. Zhang. 2011. A Survey of Audio-Based Music Classification and Annotation. IEEE Trans. Multimedia 13, 2 (Apr. 2011), 303–319.
  • Ghedini et al. (2015) F. Ghedini, F. Pachet, and P. Roy. 2015. Multidisciplinary Contributions to the Science of Creative Thinking. Springer, Chapter Creating Music and Texts with Flow Machines.
  • Goehr (1994) L. Goehr. 1994. The Imaginary Museum of Musical Works: An Essay in the Philosophy of Music. Oxford University Press.
  • Gouyon (2005) F. Gouyon. 2005. A computational approach to rhythm description — Audio features for the computation of rhythm periodicity functions and their use in tempo induction and music content processing. Ph.D. Dissertation. Universitat Pompeu Fabra.
  • Gouyon and Dixon (2004) F. Gouyon and S. Dixon. 2004. Dance Music Classification: A Tempo-Based Approach. In Proc. ISMIR. 501–504.
  • Gouyon et al. (2004) F. Gouyon, S. Dixon, E. Pampalk, and G. Widmer. 2004. Evaluating rhythmic descriptors for musical genre classification. In Proc. Int. Audio Eng. Soc. Conf. 196–204.
  • Gouyon et al. (2013) F. Gouyon, B. L. Sturm, J. L. Oliveira, N. Hespanhol, and T. Langlois. 2013. On evaluation in music autotagging research. (submitted) (2013).
  • Hamel and Eck (2010) P. Hamel and D. Eck. 2010.

    Learning features from music audio with deep belief networks. In

    Proc. ISMIR. 339–344.
  • Hastie et al. (2009) T. Hastie, R. Tibshirani, and J. Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2 ed.). Springer-Verlag.
  • Hiller and Isaacson (1959) L. Hiller and L. Isaacson. 1959. Experimental Music: Composition with an Electronic Computer. Greenwood Press.
  • Holzapfel and Stylianou (2008) A. Holzapfel and Y. Stylianou. 2008. Rhythmic similarity of music based on dynamic periodicity warping. In Proc. ICASSP. 2217–2220.
  • Holzapfel and Stylianou (2009) A. Holzapfel and Y. Stylianou. 2009. A Scale Based Method for Rhythmic Similarity of Music. In Proc. ICASSP. Taipei, Taiwan, 317–320.
  • Humphrey et al. (2013) E. J. Humphrey, J. P. Bello, and Y. LeCun. 2013. Feature Learning and Deep Architectures: New Directions for Music Informatics. J. Intell. Info. Systems 41, 3 (2013), 461–481.
  • Krebs et al. (2013) F. Krebs, S. Böch, and G. Widmer. 2013. Rhythmic Pattern Modeling for Beat and Downbeat Tracking in Musical Audio. In Proc. ISMIR.
  • Lidy (2006) T. Lidy. 2006. Evaluation of New Audio Features and Their Utilization in Novel Music Retrieval Applications. Master’s thesis. Vienna University of Tech., Vienna, Austria.
  • Lidy et al. (2010) T. Lidy, R. Mayer, A. Rauber, P. P. de Leon, A. Pertusa, and J. Quereda. 2010. A Cartesian Ensemble of Feature Subspace Classifiers for Music Categorization. In Proc. ISMIR. 279–284.
  • Lidy and Rauber (2005) T. Lidy and A. Rauber. 2005. Evaluation of Feature Extractors and Psycho-Acoustic Transformations for Music Genre Classification. In Proc. ISMIR. 34–41.
  • Lidy and Rauber (2008) Thomas Lidy and Andreas Rauber. 2008. Classification and Clustering of Music for Novel Music Access Applications. In Machine Learning Techniques for Multimedia, Matthieu Cord and Pádraig Cunningham (Eds.). Springer Berlin / Heidelberg, 249–285.
  • Lidy et al. (2007) T. Lidy, A. Rauber, A. Pertusa, and J. M. I nesta. 2007. Improving Genre Classification by Combination of Audio and Symbolic Descriptors Using a Transcription System. In Proc. ISMIR. Vienna, Austria, 61–66.
  • Mayer et al. (2010) R. Mayer, A. Rauber, P. J. Ponce de León, C. Pérez-Sancho, and J. M. Iñesta. 2010. Feature selection in a cartesian ensemble of feature subspace classifiers for music categorisation. In Proc. ACM Int. Workshop Machine Learning and Music. 53–56. DOI: 
  • Nguyen et al. (2015) A. Nguyen, J. Yosinski, and J. Clune. 2015. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proc. CVPR. 427–436.
  • Pachet (2003) Francois Pachet. 2003. The Continuator: Musical Interaction with Style. Journal of New Music Research 32, 3 (2003), 333–341.
  • Pachet (2011) F. Pachet. 2011. Musical Metadata and Knowledge Management. In Encyclopedia of Knowledge Management, David G. Schwartz and Dov Te’eni (Eds.). IGI Global, 1192–1199.
  • Peeters (2005) G. Peeters. 2005. Rhythm classification using spectral rhythm patterns. In Proc. ISMIR.
  • Peeters (2011) G. Peeters. 2011. Spectral and Temporal Periodicity Representations of Rhythm for the Automatic Classification of Music Audio Signal. IEEE Trans. Audio, Speech, Lang. Process. 19, 5 (July 2011), 1242–1252.
  • Pfungst (1911) O. Pfungst. 1911. Clever Hans (The horse of Mr. Von Osten): A contribution to experimental animal and human psychology. Henry Holt, New York.
  • Pikrakis (2013) A. Pikrakis. 2013. A deep learning approach to rhythm modeling with applications. In Proc. Int. Workshop Machine Learning and Music.
  • Pohle et al. (2009) T. Pohle, D. Schnitzer, M. Schedl, P. Knees, and G. Widmer. 2009. On rhythm and general music similarity. In Proc. ISMIR.
  • Roads (1996) C. Roads. 1996. Computer Music Tutorial. The MIT Press.
  • Scaringella et al. (2006) N. Scaringella, G. Zoia, and D. Mlynek. 2006. Automatic Genre Classification of Music Content: A Survey. IEEE Signal Process. Mag. 23, 2 (Mar. 2006), 133–141.
  • Schindler and Rauber (2012) A. Schindler and A. Rauber. 2012. Capturing the temporal domain in Echonest Features for improved classification effectiveness. In Proc. Adaptive Multimedia Retrieval.
  • Schlüter and Osendorfer (2011) J. Schlüter and C. Osendorfer. 2011.

    Music Similarity Estimation with the Mean-Covariance Restricted Boltzmann Machine. In

    Proc. ICMLA.
  • Schnitzer et al. (2011) Dominik Schnitzer, Arthur Flexer, Markus Schedl, and Gerhard Widmer. 2011. Using Mutual Proximity to Improve Content-Based Audio Similarity. In ISMIR. 79–84.
  • Schnitzer et al. (2012) D. Schnitzer, A. Flexer, M. Schedl, and G. Widmer. 2012. Local and global scaling reduce hubs in space. J. Machine Learning Res. 13 (2012), 2871–2902.
  • Seyerlehner (2010) K. Seyerlehner. 2010. Content-based Music Recommender Systems: Beyond Simple Frame-level Audio Similarity. Ph.D. Dissertation. Johannes Kepler University, Linz, Austria.
  • Seyerlehner et al. (2012) K. Seyerlehner, M. Schedl, R. Sonnleitner, D. Hauger, and B. Ionescu. 2012. From improved auto-taggers to improved music similarity measures. In Proc. Adaptive Multimedia Retrieval. Copenhagen, Denmark.
  • Seyerlehner et al. (2010) K. Seyerlehner, G. Widmer, and T. Pohle. 2010. Fusing block-level features for music similarity estimation. In Proc. DAFx. 1–8.
  • Shannon and Weaver (1998) C. E. Shannon and W. Weaver. 1998. The Mathematical Theory of Communication. University of Illinois Press.
  • Silla et al. (2008) C. N. Silla, A. L. Koerich, and C. A. A. Kaestner. 2008. The Latin Music Database. In Proc. ISMIR. 451–456.
  • Slaney (1998) M. Slaney. 1998. Auditory Toolbox. Technical Report. Interval Research Corporation.
  • Song et al. (2012) Y. Song, S. Dixon, and M. Pearce. 2012. Evaluation of Musical Features for Emotion Classification. In Proc. ISMIR.
  • Sturm (2012) B. L. Sturm. 2012. Two Systems for Automatic Music Genre Recognition: What Are They Really Recognizing?. In Proc. ACM MIRUM Workshop. 69–74.
  • Sturm (2013) B. L. Sturm. 2013. Classification Accuracy Is Not Enough: On the Evaluation of Music Genre Recognition Systems. J. Intell. Info. Systems 41, 3 (2013), 371–406.
  • Sturm (2014a) B. L. Sturm. 2014a. A simple method to determine if a music information retrieval system is a “horse”. IEEE Trans. Multimedia 16, 6 (2014), 1636–1644.
  • Sturm (2014b) B. L. Sturm. 2014b. The State of the Art Ten Years After a State of the Art: Future Research in Music Information Retrieval. J. New Music Research 43, 2 (2014), 147–172.
  • Sturm (2014c) B. L. Sturm. 2014c. A Survey of Evaluation in Music Genre Recognition. In Adaptive Multimedia Retrieval: Semantics, Context, and Adaptation, A. Nürnberger, S. Stober, B. Larsen, and M. Detyniecki (Eds.), Vol. LNCS 8382. 29–66.
  • Sturm et al. (2014) B. L. Sturm, C. Kereliuk, and A. Pikrakis. 2014. A Closer Look at Deep Learning Neural Networks with Low-level Spectral Periodicity Features. In Proc. Int. Workshop on Cognitive Info. Process. 1–6.
  • Su et al. (2014) Li Su, C.-C.M. Yeh, Jen-Yu Liu, Ju-Chiang Wang, and Yi-Hsuan Yang. 2014. A Systematic Evaluation of the Bag-of-Frames Representation for Music Information Retrieval. Multimedia, IEEE Transactions on 16, 5 (Aug 2014), 1188–1200. DOI: 
  • Szegedy et al. (2014) C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. 2014. Intriguing properties of neural networks. In Proc. ICLR.
  • Thomas et al. (2013) N. G. Thomas, P. Pasquier, A. Eigenfeldt, and J. B. Maxwell. 2013. A methodology for the comparuison of melodic generation models using meta-melo. In Proc. ISMIR.
  • Tsunoo et al. (2009) E. Tsunoo, G. Tzanetakis, N. Ono, and S. Sagayama. 2009. Audio Genre Classification Using Percussive Pattern Clustering Combined with Timbral Features. In Proc. ICME.
  • Tsunoo et al. (2011) E. Tsunoo, G. Tzanetakis, N. Ono, and S. Sagayama. 2011. Beyond Timbral Statistics: Improving Music Classification Using Percussive Patterns and Bass Lines. IEEE Trans. Audio, Speech, and Lang. Process. 19, 4 (May 2011), 1003–1014.
  • Turnbull et al. (2008) D. Turnbull, L. Barrington, D. Torres, and G. Lanckriet. 2008. Semantic annotation and retrieval of music and sound effects. IEEE Trans. Audio, Speech, Lang. Process. 16 (2008).
  • Tzanetakis and Cook (2002) G. Tzanetakis and P. Cook. 2002. Musical genre classification of audio signals. IEEE Trans. Speech Audio Process. 10, 5 (July 2002), 293–302.
  • Urbano et al. (2013) J. Urbano, M. Schedl, and X. Serra. 2013. Evaluation in music information retrieval. J. Intell. Info. Systems 41, 3 (Dec. 2013), 345–369.
  • Wang (2003) A. Wang. 2003. An industrial strength audio search algorithm. In Proc. Int. Soc. Music Info. Retrieval.
  • Wiggins (2009) G. A. Wiggins. 2009. Semantic gap?? Schemantic Schmap!! Methodological Considerations in the Scientific Study of Music. In Proc. IEEE Int. Symp. Mulitmedia. 477–482.
  • World Sport Dance Federation (2014) World Sport Dance Federation 2014. World Sport Dance Federation Competition Rules. World Sport Dance Federation, Bucharest, Romania.