Automatic chord estimation (ACE) has been one of the major challenges in music informatics. It can be a significant subproblem in tasks such as cover song identification Bello , Lee , Serra et al. , music structural segmentation Bello and Pickens , and genre classification Cheng et al. , Pérez-Sancho et al. . It also has a critical role in problems such as audio key detection Papadopoulos and Tzanetakis , Pauwels and Martens  and downbeat estimation Papadopoulos and Peeters , Mauch and Dixon [2010a].
Human chord estimation experts have developed websites such as UltimateGuitar111ultimate-guitar.com, E-chords222e-chords.com and many others333polygonguitar.blogspot.hk; chords-haven.blogspot.hk; azchords.com, where the chords of millions of songs can be found. To be useful for practical purposes (e.g., song covering, busking, rehearsal, performance), those chords are often captured in great details, with a large chord vocabulary including suspensions, extensions, inversions and even alterations. As the music production rate grows, it is foreseeable that these chord services will increasingly rely on ACE technologies.
It is not uncommon to associate ACE closely with automatic speech recognition (ASR). We have witnessed advancements in ASR such as the hybrid approaches based on “Gaussian mixture model (GMM) + hidden Markov model (HMM)” (i.e., GMM-HMM)Huang et al. , Deng , and more recently with deep learning techniques Deng and Yu , Yu and Deng . Naturally, one would consider that ASR solutions can also be applied to ACE with similar success Sheh and Ellis . However, while ASR requires the algorithm to output a sequence of words, ACE requires the algorithm to output a time-aligned sequence of chords that matches the chord onsets in the input sequence. Therefore, unlike ASR, ACE requires both segmentation and classification techniques.
Most ACE systems do have similar designs as ASR systems. However, by following the ASR tradition, they perform segmentation and classification in one single pass, overlooking a possible “divide and conquer” strategy towards the two tasks. The problem scenario in ACE differs from ASR in that chords are usually segmented rhythmically.
1.1 Brief Overview of ACE Systems
The ACE problem has been studied for around two decades. The very first approach Fujishima 
takes a sequence of pitch class profiles (PCP), or chromagram (a sequence of salience vectors for each of the twelve pitch classes)Wakefield , as the audio feature, and decodes the chord sequence using a template based mean-smoothing method. Several subsequent approaches Sheh and Ellis , Bello and Pickens 
replace this method by GMM-HMM based probabilistic models. Motivated by the success of GMM-HMM, many machine learning based ACE methods that derive the GMM-HMM parameters by data have emergedWeller et al. , Khadkevich and Omologo , Ni et al. , Cho and Bello .
Differences in implementation details notwithstanding, most traditional ACE systems are based on a similar architecture, in which the chromagram is extracted as feature, and then a GMM-HMM-like (or conditional random field Burgoyne et al. , Weller et al. , Markov logic network Papadopoulos and Tzanetakis 
) is used to decode the chromagram for a chord sequence. Amongst the most prominent of those systems are the dynamic-Bayesian-network based musical probabilistic systemMauch and Dixon [2010b], and its HMM version - Chordino Mauch [2010a].
Recently, deep learning based approaches have started to emerge in ACE. To very briefly recap, there are: 1. a convolution neural network (CNN) based systemHumphrey and Bello , that locally processes each input frame using a CNN and then globally post-processes the CNN outputs with median-filtering for a chord sequence; 2. a hybrid fully-connected neural network (FCNN) + recurrent neural network (RNN) system Boulanger-Lewandowski et al. 2015], and a hybrid DBN + HMM system Zhou and Lerch , both of which are variants of the previous FCNN+RNN system. All of them have shown comparable or better results than the state-of-the-art in metrics with major and minor triads.
1.2 ACE Evaluations
According to the reports444http://www.music-ir.org/mirex/wiki/MIREX_HOME of the annual “music information retrieval evaluation exchange” Downie  (MIREX), ACE evaluations before 2013 had mainly focused on “MajMin” vocabulary, which contains 12 major chords, 12 minor chords and an “NC” chord (representing anything that cannot be described by “chords”). Unfortunately, this vocabulary is far from covering all chords in practice.
To form a more complete evaluation, a necessary first step is to incorporate chord inversions and seventh chords into the vocabulary. Since 2013, there are 4 evaluation vocabularies in MIREX ACE: “MajMinBass”, which contains “MajMin” and their first and second inversions; “Sevenths”, which contains 7, min7 and maj7 beyond “MajMin”; and “SeventhsBass”, which contains “Sevenths” and all of their inversions. An ACE system can be designed to support any vocabulary, and the MIREX ACE evaluation tool Pauwels and Peeters , Raffel et al.  will try to perform necessary chord mappings based on the evaluation vocabulary being used.
1.3 Chord Inversions
Of all the systems submitted to MIREX ACE after the new evaluation standard, only one supports chord inversions Burgoyne et al. . Due to the dominating population of root position chords (mainly root position triads), systems that do not support inversions could achieve relatively higher scores than those that support inversions Deng and Kwok  under SeventhsBass evaluation. This is because a chord’s inversion is easy to be confused with its root positions. Since the vast majority of chords are in root positions (the evaluation datasets contain mainly pop and rock music), the non-supportability of inversions makes such confusions only possible in one direction (i.e., inversions misclassified as root positions) and thus much less likely than those of the other direction (i.e., root positions misclassified as inversions).
Musically speaking, an ACE system should distinguish root positions from inversions because their sound qualities are different in many musical contexts. For example, referring to the chord progressions in Figure 1, if a system does not support inversions, it breaks bass line continuations and thus alters the harmonies. As the ultimate goal of ACE is to implement music intelligence to match human experts’ performance on chord recognition, the supportability of a more sophisticated vocabulary with chord inversions should be considered given this goal. Consequently, in this paper, all proposed systems support exactly the “SeventhsBass” vocabulary and they all undergo the “SeventhsBass” evaluation. Therefore, in this context, the large vocabulary is referring to the set of “SeventhsBass” chords (containing maj, maj/3, maj/5, min, min/b3, min/5, maj7, maj7/3, maj7/5, maj7/7, 7, 7/3, 7/5, 7/b7, min7, min7/b3, min7/5, min7/b7, and N, totally 18 + 1 types) with 19 chord types in total.
1.4 Contributions and Findings
Thus far, we have identified several research gaps. Firstly, the existing works have not considered a fundamental difference between ASR and ACE in regards to segmentation, which may lead to a possible design that considers segmentation and classification as two separate tasks. Secondly, the support for large vocabulary has been largely overlooked, particularly the support for chord inversions. Note that chord inversion is a crucial ingredient for pop and rock music, which is the primary focus of ACE research.
As a prelude to this work, recently we proposed a hybrid “chromagram extraction + deep neural network” system that classifies chords based on a pre-segmented sequence [reference made anonymous on purpose]. In this paper, we generalize this system as a deep learning based large vocabulary ACE (LVACE) design framework. The main contributions and findings are as follows:
we propose an LVACE system design framework using a combination of pre-segmentation techniques and deep neural nets;
we find that amongst all the investigated deep neural nets, the recurrent neural network performs the best in overall chord symbol recall, and significantly better than other systems in average chord quality accuracy;
we find that there is a glass ceiling that potentially hinders the progress of the current LVACE research.
The rest of the paper is organized as follows: Section 2 overviews the LVACE system framework; Sections 3 and 4 describes the system implementation under the proposed framework and explores the variations under a wide range of parameters; Finally, Section 5 concludes the paper and discusses possible future LVACE research directions.
2 System Framework
Figure 2 depicts the LVACE system framework in our study. The workflow is as follows:
Segmentation: 1. for training, the feature sequence is segmented by the ground truth annotations; 2. for validation, the feature sequence is segmented using a GMM-HMM process (to be discussed in Section 2.2).
Segment tiling: each feature segment is tiled into a fixed number of sub-segments (see Section 2.3).
Deep neural nets: 1. for training, the segments and their chord labels are used to train the deep neural nets (will be described in Section 2.4); 2. for validation, the trained neural network is used to predict chord labels.
2.1 Feature Extraction
Feature extraction starts by resampling the raw audio input at 11025 Hz, which is followed by a short-time-Fourier-transform (STFT, 4096-point Hamming window, 512-point hop size). It then proceeds to transform the linear-frequency spectrogram (2049-bin) to log-frequency spectrogram (252-bin, three bins per semitone ranging from MIDI note 21 to 104) using two cosine interpolation kernelsMauch [2010b]. The output at this step is a log-frequency-spectrogram, or log-spectrogram, , where is the index of frequency bins, and is the index of time frames. We denote the total number of frames as , and the total number of bins in each spectrum as (in this context = 252).
The amount of deviation from standard tuning is estimated using the algorithm in Mauch and Dixon [2010a], where the amount of detuning is estimated as:
where is a function wrapping its input to . is the phase angle at of the discrete-Fourier-transform (DFT) of . The tuning frequency is then computed as:
and the original tuning is updated by interpolating the original spectrogram at , where:
The “36” indicates that there are 36 bins per octave (3 bins per semitone) in . The updated log-spectrogram will be referred to as “notegram” in the following.
To enhance harmonic content and attenuate background noise, a standardization process is performed along the frequency axis:
are the mean and standard deviation of a half-octave window centered at, respectively. is then updated by .
This is followed by a non-negative least square (NNLS) method to extract a sequence of note activation patterns Mauch and Dixon [2010c] from the log-spectrogram. Concretely, assume each note activation pattern has bins (in this context = 84), the log-spectrum can be expressed as:
where is a matrix, which is a dictionary of note harmonic series profiles, and is the note activation pattern to be fitted by the algorithm. The entry of is a geometrically declining overtone series Gómez  of length :
where indicates the upper partials of tone of the original frequency axis, and is a declining factor controlling the steepness of the partials’ envelope. A large means a slower decline. Normally is within . The NNLS algorithm Lawson and Hanson  is used to find out an that minimizes the difference between and . The output of this process is usually called NNLS chromagram, or NNLS matrix (84-bin, 1 bin per semitone).
The feature dimension is further reduced before the segmentation process. Particularly, each NNLS chroma is weighted by the bass and treble profiles depicted in Figure 3. After that the saliences of each pitch class are added together, resulting in a 24-bin bass-treble chromagram. Each column of the bass-treble chromagram is then normalized, so that each bin of each chroma is within the range of [0,1].
Table 1 provides a summary of different levels of representations generated by this feature extraction process. In this paper, we mainly make use of two features from the above process: the notegram and the bass-treble chromagram (simply refered to as “chromagram” in the following). In the experiment section, we sometimes also refer to notegram as “-ns”, and chromagram as “-ch”. Figure 4 summarizes the full information flow of the above feature extraction and segmentation process using the first line of Let it be as input.
|Bass-treble Profiling||(bass-treble) chromagram||24|
The segmentation process is implemented using a GMM-HMM, which is characterized as follows:
The hidden node models the categorical states of chords. In the SeventhsBass implementation, there are totally 217 states (1 state per chord), where the 217 is found by multiplying the number of chord types (18) with the number of chord roots (12) and adding the number of “N.C.” chord types (1).
The observable node represents a chroma. It is a 24-dimension Gaussian node connecting to the bass-treble chromagram.
The transition matrix has heavy uniform self-transition weights, which are 99.99 times of the uniform non-self-transition weight.
|Bass - chord bass||1||0.1|
|Bass - not chord bass and is chord note||1||0.5|
|Bass - neither chord bass nor chord note||0||0.1|
|Treble - chord note||1||0.2|
|Treble - not chord note||0||0.2|
|No Chord (for all notes)||1||0.2|
2.3 Segment Tiling
The segment tiling process is introduced to equalize the length of every segment, so as to enable neural networks with fixed-length input. This process divides a segment into equal-sized sub-segments, and takes a frame-average within each sub-segment, resulting in an -frame segment (referred to as seg in the following, where is a variable). If the original number of frames is not divisible by , the last frame is extended to make it divisible, i.e. this process turns a segment with a variable number of frames into a segment with a fixed number of frames.
2.4 Deep Neural Nets
Each seg will be classified as a chord label through a deep neural net. There are three types of deep neural nets considered here: fully-connected neural network (FCNN), deep belief network, and recurrent neural network.
2.4.1 Fully-connected Neural Network
2.4.2 Deep Belief Network
The DBN is implemented based on the FCNN. Its multiple hidden layers have sigmoid activations instead of ReLUs.
In the pre-training phase, every pair of adjacent layers (except for the output layer) are trained one pair at a time as restricted Boltzmann machines (RBMs)Hinton et al. . In our implementation, the RBM formed by the input layer and the first hidden layer is a Gaussian-Bernoulli RBM, because the input
seg feature contains real numbers. The RBMs formed by the hidden layer pairs are Bernoulli-Bernoulli RBMs, because each neuron is stochastic binaryHinton and Salakhutdinov .
In the fine-tuning phase, the network is regarded as a feedforward neural network and trained via stochastic gradient descent with back-propagation.
2.4.3 Recurrent Neural Network
The RNN (Figure 5
) is configured with bidirectional long-short-term-memory (LSTM) hidden unitsHochreiter and Schmidhuber  (BLSTM-RNN). Figure 6 shows the structure of a basic LSTM unit Graves 
in the context of a simple logistic regression. The input data path has 4 identical copies for input gate, output gate , forget gate
and the input port. Each gate or port will activate an output between 0 and 1 according to its input and activation function. The input gate activation is multiplied with the input port activation to become an input value to the LSTM cell. The forget gate activation is multiplied with the cell value from the previous time step to become another input to the cell. The current cell value is determined by the sum of the these two cell inputs. The output of the unit is given by the multiplication of the output gate activation and the current cell value. Note that in some configuration the cell value can be fed back into the gates.
LSTM can relieve the gradient vanishing problem for long sequence training Bengio . In our LSTM implementation, all gates employ sigmoid activations, while both the cell and output neuron use hyperbolic tangent activations. For a fixed-length seg input, the RNN is unrolled into slices, each handling one input frame. A mean pooling operation is added before the output layer to summarize the LSTM outputs.
In this section, we describe a systematic approach to explore and evaluate different system variants of the LVACE framework. We first introduce the datasets, then elaborate the experimental setup, and finally discuss the training and cross-validation (CV) process.
For training/cross-validation, we use six datasets of 546 tracks in total. They contain both eastern and western pop/rock songs. They are:
29 tracks from the JayChou dataset (JayChou29, or J) Deng and Kwok ;
20 tracks from a Chinese pop song dataset (CNPop20, or C) 555http://www.tangkk.net/label/;
26 tracks from the Carole King + Queen dataset (K) dataset 666http://isophonics.net/datasets;
191 songs from the USPop dataset (U) 777https://github.com/tmc323/Chord-Annotations;
100 tracks from the RWC dataset (R) 888https://staff.aist.go.jp/m.goto/RWC-MDB/;
180 tracks from the TheBeatles180 (B) dataset Harte .
The datasets are notated by their letter codes. For example, the combination of all datasets is denoted as “CJKURB”.
Both chromagram (-ch) and notegram (-ns) are extracted from each track. Both of them can be transposed to all 12 different keys by circular pitch shifting (for -ch) or pitch shifting with zero paddings (for -ns). For example, a piece of treble chromagram in key ofcan be represented as:
where stands for the salience of pitch class . It can be circularly shifted to represent an equivalent PCP in other keys, such as key of :
As for notegram, although we have pitch saliences instead of pitch class saliences, the same “pitch shifting” ideas can still be applied, given that the out-shifted saliences are filled by zeros.
In practice, the original key is considered as a pivot, and the features are circularly shifted to all 12 keys (the amount of transpositions ranging from -6 to 5 semitones). Adjusting the ground truth chord labels accordingly, this results in a 12-time data augmentation, which helps in reducing over-fitting Cho , Humphrey .
3.2 Experimental Setup
Under the proposed LVACE framework, possible design choices are:
type of deep neural nets
depth and width of hidden layers (network configurations)
number of frames in segment tiling
input feature representations
amount of training data
Our study is based on the settings depicted in Table 3. For naming conventions: a combination of layer width and depth is denoted as [*], such as [800*2]; a segmentation tiling scheme is denoted as seg, such as 6seg; a point in this six dimensional hyper-parameter space is denoted by concatenating each parameter with “-”, such as FCNN-6seg-[800*2]-ch-JKU. The space can be explored by parameter sweeping along a given dimension. Particularly, we will first explore along the layer width and layer depth. We then explore the segment tiling scheme with fixed layer width and layer depth. Following the same strategy, we explore all factors in Table 3. This process does not search the whole hyper-parameter space. However, it could gain us some insights of the proposed LVACE framework and produce some good system variants as well.
|neural net||FCNN; DBN; BLSTM-RNN|
|segment tiling||1; 2; 3; 6; 9; 12 (seg)|
|layer depth||2; 3; 4|
|layer width||500; 800; 1000|
|input feature||notegram (-ns); chromagram (-ch)|
|amount of training data||JK; JKU; JKUR; JKURB|
In this context we regard a “model” as a crossing point of all dimensions in Table 3, including training data size. We regard a “system” as a full implementation of the LVACE framework, including the feature extraction, segmentation and deep neural nets. However, since all models share the same feature extraction and segmentation processes, we sometimes use the terms “model” and “system” interchangeably.
3.3 Training and Cross-validation
The following training procedures are applied throughout the experiments:
Each BLSTM-RNN is trained using an Adadelta optimizer Zeiler , regularized with dropout and early-stopping.
All mini-batch stochastic gradient descents use a learning rate of 0.01 and a batch size of 100.
All early-stopping criteria are monitored using the validation error of the CNPop20 dataset, which is not in any training set. The model with the lowest validation loss will be saved, and if the current validation loss is 0.996 times smaller than the lowest one, the early-stopping patience will increase by the value of the current number of iterations. Training stops when the patience is less than the current number of iterations.
All dropout rates are set to 0.5.
Five-fold cross-validation (CV) is performed throughout all experiments. Each fold is a combination of approximately 1/5 tracks of each dataset. Every model is trained on four folds and cross validated on the remaining fold, resulting in a total number of five training/cross-validation scores, the average of which will be the final score to be reported.
We provide all implementation details including the training and cross-validation scripts online 999https://github.com/tangkk/tangkk-mirex-ace, so that interested readers can repeat the experiments when they have access to the raw audio datasets.
4 Results and Discussions
Throughout this section, we use the MIREX ACE standard evaluation metric, weighted chord symbol recall (WCSR), to report system performances, where the “chord symbol recall” (CSR) is defined as follows:
where and represents the automatic estimated segments, and ground truth annotated segments, respectively, and the intersection of and are the parts where they overlap and have equal chord annotations. WCSR is the weighted average of all tracks’ CSRs by the lengths of these tracks:
where the subscript denotes the track number. Unless otherwise specified, we report all WCSR scores under the SeventhsBass evaluation, where a correct classification does not involve any chord mapping scheme beyond the SeventhsBass vocabulary Pauwels and Peeters . We use the MusOOEvaluator tool 101010https://github.com/jpauwels/MusOOEvaluator to generate these scores from all the ground truth and predicted chord sequences.
The WCSR is upper-bounded by the segmentation quality, which is computed as directional Hamming distance (DHD). The DHD from to is:
where subscription indicates the segment. Note that the distance is not commutable, which means and represent two different distances. Conventionally, measures under-segmentation and measures over-segmentation. In either case, a good segmentation is indicated by a small value. When reported as scores, they are usually normalized by the lengths of the tracks, and minus by 1, in order to make it equal to the range of the WCSR score Harte :
Note that we do not report the segmentation score for every system, because they all share the same GMM-HMM process described in Section 2.2. This process has a segmentation score of 83% on the JKURB dataset.
In the following discussion, we will analyze the experiment results from a bias-variance perspective Geman et al. 
. Assuming that a model is trained for multiple times over different samples of the same population, with other settings remain unchanged, the model’s prediction for a given input will become a random variable. The model’s prediction error can thus be expressed asFriedman et al. :
Concretely, a model’s bias is defined as the expected difference between its prediction and the ground truth. It measures how much a model’s predictions are consistently deviating from the true value, and it indicates whether a model contains fundamentally incorrect assumptions. A model’s variance, on the other hand, measures the variance (i.e. the statistical variance, which equals the square of the standard deviation) of the model’s prediction. It indicates how much a model’s predictions will vary across its different realizations with different training samples, or equivalently, how much inconsistencies are there within the predictions. Finally, the irreducible error term could be seen as collection of everything that is not bias or variance, such as the noise or inconsistencies in the data itself.
Bias and variance is highly correlated with over-fitting and under-fitting. A high bias model tends to under-fit the data, while a high variance model tends to over-fit the data. A model that neither over-fits nor under-fits, will have low bias and low variance. The amount of bias-variance can be approximated from a model’s training and validation (or cross-validation) score:
A model with high bias (under-fitting) has similar training and validation scores, but none of them is high.
A model with high variance (over-fitting) has a high training score and a low validation score.
In other words, a model’s bias can be approximated as the value of its training or validation error if the two errors are close to each other, and a model’s variance can be approximated as the difference between its training and validation errors. Note that the irreducible error could either appear as bias or variance.
In the following, we will examine how each design choice in Table 3 actually affects the systems’ biases and variances. Moreover, we also compare among different types of deep neural nets and a baseline model, and see which one performs the best in the LVACE task.
4.1 Network configurations
Figure 7 shows the WCSRs of a set of JKU-6seg models with different neural nets, network configurations and input features.
-ch models — The FCNN has local maximal validation scores when the network has two layers, and it performs worse as network becomes deeper. The DBN’s validation scores are stabilized around 50. In this group of experiments, the training and validation scores are close to each other.
-ns models — The FCNN’s validation scores are all focused around 50. As for the DBN, there is a trend of performance downgrade along the depth dimension. In both cases the differences between training and validation scores are very small.
Remarks — Firstly, all the FCNN-ch models outperform the FCNN-ns models. This could be largely due to the prior imposed by the chromagram feature. It embeds the knowledge about “pitch classes”, and it is originally designed for chord recognition tasks. On the other hand, because the notegram is several feature transformations away from the chromagram, it contains no such prior information. This could explain why the FCNN-ch models perform worse as they become deeper. Every extra layer will tend to weaken the prior at the input, and at the same time, these layers try to learn some other regularities that map the chromagrams to the chord labels. The results show that, unfortunately, the deeper networks are unable to learn more useful regularities than the prior knowledge already contained in the chromagram.
Secondly, all the DBN-ns models outperform the DBN-ch models (except for the one at [1000*4]). Note that the only difference between the DBN and the FCNN is the generative pre-training process, which in effect is a strong regularization process that prevents over-fitting. This is sometimes equivalent to increasing the model’s bias or decreasing the model’s variance. As shown in Figure 7, since the variances of the FCNN-ch models are already small, the DBN-ch models perform worse than the FCNN-ch models due to the higher biases. However, since the FCNN-ns models have high variances, the DBN-ns models perform better than the FCNN-ns because of the lower variances.
Thirdly, the performance downgrade of the DBN-ns models starting from [800*3] and [1000*2] could be caused by the well-known gradient vanishing problem in deep networks (with sigmoid activations). This could be verified by monitoring the weight updating process. When the gradient vanishing happens, the average amount of weight updates closer to the input will be much less than those closer to the output, resulting in more errors in the earlier layers, which will be aggregated through the feedforward path to the output.
4.2 Segment tiling
Figure 8 shows the WCSRs of a set of JKU-[800*2] models with different neural nets, segment tiling schemes (seg) and input features. Note that [800*2] in BLSTM-RNN means there are a forward and a backward hidden layers, each having 800 LSTM units.
-ch models — The FCNN tends to perform worse with larger ; The DBN tends to perform better with larger ; The BLSTM-RNN grows gently when is less than 3, and remains relatively constant thereafter. They all have very small variances.
-ns models — Still the FCNN tends to perform worse with larger ; Both the DBN and the BLSTM-RNN have the worst performances when is 1, and they have higher and stable performances when is greater than 1.
Remarks — With a large seg, a model becomes more complex because of a less blurry input, so that one could expect either less bias, or more variance. In the FCNN models we could observe a tendency of slight variance increasing. This tendency has possibly offset the trend of bias decreasing, which we could not see from Figure 8.
For the DBN, as discussed in Section 4.1, the generative pre-training processes could reduce the variances or increase the biases. This is clearly reflected in Figure 8 if we compare the DBN’s WCSRs with the FCNN’s. On one hand, this could explain why the DBN-ch models have an increasing trend of performances (as well as a decreasing trend of biases) with a larger , because the DBN-ch model has a much higher bias than the FCNN-ch one when equals 1. On the other hand, it could also explain why there is a performance boost from to in the DBN-ns models, and that the DBN-ns models have consistently lower variances and higher performances than the FCNN-ns models.
For the BLSTM-RNN models, the training and CV curves are much more spread out than those of the FCNN and the DBN. On one hand, the RNN imposes a weight sharing mechanism across the segment tiling frames. This has an effect of regularization by limiting the number of parameters connecting the input layer to the hidden layer, thus limiting the network’s ability to recognize arbitrary dependencies across frames. On the other hand, the RNN also introduces a set of recurrent weights that connect each frame to its next frame. This makes the network more flexible in capturing sequential dependencies between frames. This explains why the training and CV curves are so separated. Particularly, the average training scores of the RNN models are much higher than those of the DBN’s and the FCNN’s, because the RNN is essentially biased towards problems with sequential natures, and the ACE is one of these problems. Still, we have relatively low CV scores in these RNN models, which lead to high variances. As we will see in the following subsection, this can be remedied by more training data.
4.3 Amount of training data
Figure 9 shows the WCSRs of a set of 6seg-[800*2] models with different neural nets, training data sizes and input features.
For both -ch and -ns models — In all three plots, there are clear trends that increasing the amount of data boosts the models’ performances. While the variances of the FCNN and DBN models remain being small, the variances of the BLSTM-RNN models tend to decrease with the increase of data. Interestingly for the FCNN, the increase of data from JKUR to JKURB leads to larger variances and worse CV scores.
Remarks — Similar to Figure 8, the FCNN’s and DBN’s training and CV curves as shown in Figure 9 are still very close to each other. It seems that as the amount of data increases, their models have saturated at some point and there is little room for further improvement. On the contrary, the BLSTM-RNN’s training and CV curves are much wider apart, and the models tend to generalize better as the data size grows. The results in Figure 9 seem to suggest that we have almost touched a performance “ceiling” of the BLSTM-RNN-ch models, but we have yet to reach that of the BLSTM-RNN-ns models.
4.4 Input feature
From Figure 7 to 9, we see that the training and CV curves of the chromagram models are on average much closer than those of the notegram models. This suggests that the prior knowledge contained in the chromagram feature actually introduces bias in the model. On one hand, this may lead to better models if the amount training data is limited (this discussion is not valid for the DBN because of the generative pre-training process). On the other hand, this can also limit the models’ improvement when we have sufficient amount of training data. For example in Figure 9, we see a potential trend that if more data is added to the model, the BLSTM-RNN-ns model will eventually outperform the BLSTM-RNN-ch model, because the BLSTM-RNN-ns has a higher ceiling (the training score) than the BLSTM-RNN-ch.
4.5 Balanced performance
The above discussions are focused on the overall WCSR
s of different models. Here we are going to examine the models’ performances on specific chords. Note that in our datasets (which we believe are good representatives of pop and rock music in general), the chord distributions are highly skewed (as shown in Table4), where the and triads make up almost 70% of the whole sample population, the , and chords constitute more than 20%, and the portion of other chords are less than 10%. In the following discussion, we refer to “common chords” as the and chords, “uncommon chords” as the sevenths chords, including the , and chords, and “long-tail chords” as all the other chords in the SeventhsBass vocabulary. Moreover, we use “chord” and “chord type” interchangeably to refer to a certain type of chords. We report system performance on chords using the per chord WCSR:
where the subscript denotes the instance of the chord within the data set.
Figure 10 shows how different deep neural net models perform on different chords. It is surprising to see that the FCNN and the DBN outperform the BLSTM-RNN only in the maj chord category, while the BLSTM-RNN outscores the other two by large margins in most long-tail chords and uncommon chords categories.
Furthermore, we examine the versatilities of different deep neural net models. We measure them using the “Average Chord Quality Accuracy”(ACQA) Cho , which averages the WCSRs of all chords with equal weights:
Models that over-fit a few chord types tend to give lower ACQAs, while those well-balanced ones will have higher ACQAs. As shown in Figure 11, the average ACQA of the BLSTM-RNN models outscores the average ACQAs of the other two types of models by around 10 points.
We perform a Friedman test Friedman  on the track-wise ACQA results. After that we use the Tukey HSD (honest significant difference) Tukey  to perform a multiple comparison test on the Friedman test’s statistics with a significance level of 0.05. As shown in Figure 12
, both BLSTM-JKURB-ch-6seg-800 and BLSTM-JKURB-ns-6seg-800 are significantly better (no overlap of confidence intervals) than the other systems, and BLSTM-JKURB-ch-6seg-800 is significantly better than BLSTM-JKURB-ns-6seg-800 as well. This concludes that the BLSTM-RNN models are significantly better than the FCNN and the DBN models in terms ofACQAs, or balanced performances.
Now we have concrete evidence that the BLSTM-RNN is a better neural network in solving the LVACE problem than the other two models. It is reasonable to think that the BLSTM-RNN regards its input as a sequence of frames, while fully-connected networks (in this context the FCNN and DBN) regard their inputs as flat vectors. Therefore, while the BLSTM-RNN tries to look for regularities within each pair of consecutive frames along the time direction, the FCNN or DBN would search for regularities within every point of the flat vector as if they are not time related at all. Another perspective is that the BLSTM-RNN has times the weights as much as those of the FCNN’s and the DBN’s between the input layer and the first hidden layer. Thus the weight sharing over multiple frames prevents the BLSTM-RNN from over-fitting, and allows the model to process higher resolution inputs without an increase in parameters.
In some cases, the fully-connected network is more efficient given that the input feature has already encoded certain prior information about music (e.g. chromagram contains the information about pitch classes). Nevertheless, it overlooks the “sequential order of frames”, which probably causes the over-fitting of root position chords and the under-fitting of chord inversions.
4.6 Baseline comparison
Finally, we compare our LVACE framework with the Chordino. It should be emphasized that Chordino is the only suitable baseline because: (1) Our framework resembles Chordino in terms of the segmentation and feature extraction processes; (2) Chordino is the only other system that supports seventh chords and chord inversions.
We choose one representative for each type of deep neural net, all trained and cross-validated with JKURB-6seg-ns-[800*2], and compare them with the Chordino using the standard MIREX ACE categories as described in Section 1.2. As shown in Figure 13, the representative of BLSTM-RNN outperforms the Chordino by large margin in Sevenths and SeventhsBass, and it scores fairly close to the Chordino in MajMin and MajMinBass. The other two representatives are not performing as good as the Chordino in most categories.
We perform a Friedman test on the track-wise SeventhsBass WCSR results. After that we use the Tukey HSD to perform a multiple comparison test on the Friedman test’s statistics with a significance level of 0.05. As shown in Figure 14, BLSTM-JKURB-ns-6seg-[800*2] is significantly better than the other systems as well as the Chordino.
In terms of ACQA, as shown in Figure 15, Chordino outperforms both the FCNN’s and DBN’s representatives, but the most balanced system is the BLSTM-RNN’s representative. We again perform a Friedman test with Tukey HSD (, using the track-wise results) to test whether the differences in ACQAs are significant. As shown in Figure 16, the BLSTM-JKURB-ns-6seg-[800*2] system is again significantly better than the other systems as well as the Chordino.
In this paper we present an in-depth discussion of a hybrid “GMM-HMM + deep neural net” LVACE approach. Preluded with an argument for the necessity of recognizing chord inversions in practical ACE systems, our work is motivated by a current research gap in ACE, which is the overlooking of large vocabulary and chord inversions. This is the rationale behind the SeventhsBass LVACE implementation. We then put forward the LVACE system framework, which has handcrafted feature extraction and segmentation processes, and uses deep neural nets to classify chords from the features. We conduct several groups of experiments on different system variants of the LVACE framework, from which we report the following major findings:
The chromagram feature contains prior knowledge about musical pitch class that increases the bias and limits the potential improvement of the models.
The BLSTM-RNN can learn regularities from the notegram feature that potentially outperforms the chromagram feature.
The BLSTM-RNN’s representative system (with all available training data) has significantly better WCSR and ACQA than the FCNN’s one, the DBN’s one, and the Chordino.
Despite the best system variant significantly outperforms the baseline system, all training and CV scores presented in this paper are still far less than 100%. This indicates either there is large bias in the LVACE framework itself, or there is irreducible error in the underlying data. We speculate three potential causes as explained in the following.
Firstly, the performance of the proposed framework is upper-bounded by the segmentation performance of the GMM-HMM process introduced in Section 2.2, and the performance of this process on the JKURB set is 83%.
Secondly, the segment tiling process introduces bias to the system, since it assumes a chord can be correctly recognized after we tile its original features into several frames of averaged features. This process could help prevent over-fitting by regularizing the degree of freedom of the input, but at times it scarifies important information conveyed in the original variable-length features.
The above two points set a hard performance limit of the proposed LVACE framework: unless the chord segmentation technique is perfect and the segment tiling process is completely excluded, one could not expect a system with very low bias.
Thirdly, there is non-negligible amount of noise in the ground truth annotations themselves. Inevitably, due to differences in musical training, human annotators sometimes disagree, particularly on long-tail chords Humphrey and Bello . This results in a glass ceiling for LVACE: unless there are more data for uncommon and long-tail chords and they are more consistently labeled, all efforts for improving LVACE will be hindered by the lack of skewed class training and data consistency.
In a very strict sense, there is not any “gold standard” if human annotators themselves might disagree with each other. But in a loose sense, there could be a “gold standard” if:
all annotations are done by only one annotator, or
all annotations are done by multiple annotators (much more than two).
In the former case, the only annotator “dictates” a local “gold standard”, so that whenever a machine tries to learn from the data, it actually targets at this annotator’s “style”. In the latter case, multiple annotators decide a “gold standard” in a way such as majority vote or data fusion Koops et al. , Klein , so that a trained model actually aims at the optimal “style” that minimizes the objections among these annotators. Therefore, although the “gold standard” is indeed an important issue, we still have to design a system that “learns well”.
We believe that the next step of LVACE research should focus more on improving the recognition accuracies on uncommon and long-tail chords. That is, instead of considering the overall WCSR of a large vocabulary, attention should also be given to the balanced metric, such as ACQA. Although we have pointed out that the BLSTM-RNN is very promising in handling large vocabulary with inversions, we have yet to explored possible ways to train the network under such “imbalanced class population ” scenario Chawla et al. . More importantly, we should spend greater efforts on data collection, particularly of long-tail chords, and at the same time ensure the data integrity and consistency, in the future development of LVACE.
- Bello  Juan Pablo Bello. Audio-based cover song retrieval using approximate chord sequences: Testing shifts, gaps, swaps and beats. In Proceedings of the 8th International Society for Music Information Retrieval Conference, ISMIR 2007, volume 7, pages 239–244, 2007.
- Lee  Kyogu Lee. Identifying cover songs from audio using harmonic representation. extended abstract, Music Information Retrieval eXchange task, Victoria, BC, Canada, 2006.
- Serra et al.  Joan Serra, Emilia Gómez, and Perfecto Herrera. Audio cover song identification and similarity: background, approaches, evaluation, and beyond. pages 307–332, 2010.
- Bello and Pickens  Juan Pablo Bello and Jeremy Pickens. A robust mid-level representation for harmonic content in music signals. In Proceedings of the 6th International Society for Music Information Retrieval Conference, ISMIR, volume 5, pages 304–311, 2005.
- Cheng et al.  Heng-Tze Cheng, Yi-Hsuan Yang, Yu-Ching Lin, I-Bin Liao, and Homer H Chen. Automatic chord recognition for music classification and retrieval. In IEEE International Conference on Multimedia and Expo, pages 1505–1508. IEEE, 2008.
- Pérez-Sancho et al.  Carlos Pérez-Sancho, David Rizo, and José M Inesta. Genre classification using chords and stochastic language models. Connection science, 21(2-3):145–159, 2009.
- Papadopoulos and Tzanetakis  Hélène Papadopoulos and George Tzanetakis. Modeling chord and key structure with Markov logic. In Proceedings of the 13th International Society for Music Information Retrieval Conference, ISMIR, pages 127–132. Citeseer, 2012.
- Pauwels and Martens  Johan Pauwels and Jean-Pierre Martens. Integrating musicological knowledge into a probabilistic framework for chord and key extraction. In Audio Engineering Society Convention 128. Audio Engineering Society, 2010.
- Papadopoulos and Peeters  Hélene Papadopoulos and Geoffroy Peeters. Simultaneous estimation of chord progression and downbeats from an audio file. In IEEE International Conference on Acoustics, Speech and Signal Processing. ICASSP, pages 121–124. IEEE, 2008.
- Mauch and Dixon [2010a] Matthias Mauch and Simon Dixon. Simultaneous estimation of chords and musical context from audio. IEEE Transactions on Audio, Speech, and Language Processing, 18(6):1280–1289, 2010a.
- Huang et al.  Xuedong Huang, Alex Acero, Hsiao-Wuen Hon, and Raj Foreword By-Reddy. Spoken language processing: A guide to theory, algorithm, and system development. 1, 2001.
- Deng  Li Deng. Dynamic speech models: theory, algorithms, and applications. Synthesis Lectures on Speech and Audio Processing, 2(1):1–118, 2006.
- Deng and Yu  Li Deng and Dong Yu. Deep learning: methods and applications. Foundations and Trends in Signal Processing, 7(3–4):197–387, 2014.
- Yu and Deng  Dong Yu and Li Deng. Deep learning and its applications to signal and information processing [exploratory dsp]. Signal Processing Magazine, IEEE, 28(1):145–154, 2011.
- Sheh and Ellis  Alexander Sheh and Daniel PW Ellis. Chord segmentation and recognition using EM-trained hidden Markov models. In Proceedings of the 4th International Society for Music Information Retrieval Conference, ISMIR, pages 185–191. International Symposium on Music Information Retrieval, 2003.
- Fujishima  Takuya Fujishima. Realtime chord recognition of musical sound: A system using common lisp music. In Proceedings of the 25th International Computer Music Conference, volume 1999, pages 464–467, 1999.
- Wakefield  Gregory H Wakefield. Mathematical representation of joint time-chroma distributions. In SPIE’s International Symposium on Optical Science, Engineering, and Instrumentation, pages 637–645. International Society for Optics and Photonics, 1999.
- Weller et al.  Adrian Weller, Daniel Ellis, and Tony Jebara. Structured prediction models for chord transcription of music audio. In International Conference on Machine Learning and Applications, ICMLA, pages 590–595. IEEE, 2009.
- Khadkevich and Omologo  Maksim Khadkevich and Maurizio Omologo. Time-frequency reassigned features for automatic chord recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 181–184. IEEE, 2011.
- Ni et al.  Yizhao Ni, Matt McVicar, Raul Santos-Rodriguez, and Tijl De Bie. An end-to-end machine learning system for harmonic analysis of music. IEEE Transactions on Audio, Speech, and Language Processing, 20(6):1771–1783, 2012.
- Cho and Bello  Taemin Cho and Juan P Bello. MIREX 2013: Large vocabulary chord recognition system using multi-band features and a multi-stream HMM. Music Information Retrieval Evaluation eXchange (MIREX), 2013.
- Burgoyne et al.  John Ashley Burgoyne, Laurent Pugin, Corey Kereliuk, and Ichiro Fujinaga. A cross-validated study of modelling strategies for automatic chord recognition in audio. In Proceedings of the 8th International Society for Music Information Retrieval Conference, ISMIR, pages 251–254, 2007.
- Mauch and Dixon [2010b] Matthias Mauch and Simon Dixon. MIREX 2010: Chord detection using a dynamic Bayesian network. Music Information Retrieval Evaluation Exchange (MIREX), 2010b.
- Mauch [2010a] Matthias Mauch. Simple chord estimate: Submission to the MIREX chord estimation task. Music Information Retrieval Evaluation Exchange (MIREX), 2010a.
- Humphrey and Bello  Eric J Humphrey and Juan P Bello. Rethinking automatic chord recognition with convolutional neural networks. In Proceedings of the 11th International Conference on Machine Learning and Applications (ICMLA), volume 2, pages 357–362. IEEE, 2012.
- Boulanger-Lewandowski et al.  Nicolas Boulanger-Lewandowski, Yoshua Bengio, and Pascal Vincent. Audio chord recognition with recurrent neural networks. In Proceedings of the 14th International Society for Music Information Retrieval Conference, ISMIR, pages 335–340, 2013.
- Sigtia et al.  Siddharth Sigtia, Nicolas Boulanger-Lewandowski, and Simon Dixon. Audio chord recognition with a hybrid recurrent neural network. In Proceedings of the 16th International Society for Music Information Retrieval Conference (ISMIR), 2015.
- Zhou and Lerch  Xinquan Zhou and Alexander Lerch. Chored detection using deep learning. In Proceedings of the 16th International Society for Music Information Retrieval Conference, ISMIR, volume 53, 2015.
- Downie  J Stephen Downie. The music information retrieval evaluation exchange (2005-2007): A window into music information retrieval research. Acoustical Science and Technology, 29(4):247–255, 2008.
- Pauwels and Peeters  Johan Pauwels and Geoffroy Peeters. Evaluating automatically estimated chord sequences. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 749–753. IEEE, 2013.
- Raffel et al.  Colin Raffel, Brian McFee, Eric J Humphrey, Justin Salamon, Oriol Nieto, Dawen Liang, Daniel PW Ellis, and C Colin Raffel. mir_eval: A transparent implementation of common mir metrics. In Proceedings of the 15th International Society for Music Information Retrieval Conference, ISMIR. Citeseer, 2014.
- Burgoyne et al.  J Ashley Burgoyne, W Bas de Haas, and Johan Pauwels. On comparative statistics for labelling tasks: What can we learn from MIREX ACE 2013. In Proceedings of the 15th Conference of the International Society for Music Information Retrieval (ISMIR), pages 525–530, 2014.
- Deng and Kwok  Junqi Deng and Yu-Kwong Kwok. Automatic chord estimation on SeventhsBass chord vocabulary using deep neural network. In Proceedings of the 41th International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016.
- Mauch [2010b] Matthias Mauch. Automatic chord transcription from audio using computational models of musical context. PhD thesis, School of Electronic Engineering and Computer Science Queen Mary, University of London, 2010b.
- Mauch and Dixon [2010c] Matthias Mauch and Simon Dixon. Approximate note transcription for the improved identification of difficult chords. In Proceedings of the 11th International Society for Music Information Retrieval Conference, ISMIR, pages 135–140, 2010c.
- Gómez  Emilia Gómez. Tonal description of music audio signals. Department of Information and Communication Technologies, 2006.
- Lawson and Hanson  Charles L Lawson and Richard J Hanson. Solving least squares problems. 15, 1995.
- Hinton et al.  Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
- Hinton and Salakhutdinov  Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
- Hochreiter and Schmidhuber  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Graves  Alex Graves. Supervised sequence labelling. 2012.
- Bengio  Yoshua Bengio. Learning deep architectures for AI. Foundations and trends® in Machine Learning, 2(1):1–127, 2009.
- Deng and Kwok  Junqi Deng and Yu-Kwong Kwok. MIREX 2015 submission: Automatic chord estimation with chord correction using neural network, 2015.
- Harte  Christopher Harte. Towards automatic extraction of harmony information from music signals. PhD thesis, Department of Electronic Engineering, Queen Mary, University of London, 2010.
- Cho  Taemin Cho. Improved techniques for automatic chord recognition from music audio signals. PhD thesis, New York University, 2014.
- Humphrey  Eric Humphrey. An Exploration of Deep Learning in Music Informatics. PhD thesis, New York University, 2015.
- Srivastava et al.  Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
- Prechelt  Lutz Prechelt. Automatic early stopping using cross validation: quantifying the criteria. Neural Networks, 11(4):761–767, 1998.
- Zeiler  Matthew D Zeiler. ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
- Geman et al.  Stuart Geman, Elie Bienenstock, and René Doursat. Neural networks and the bias/variance dilemma. Neural computation, 4(1):1–58, 1992.
- Friedman et al.  Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. 1, 2001.
- Friedman  Milton Friedman. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the american statistical association, 32(200):675–701, 1937.
- Tukey  John W Tukey. Comparing individual means in the analysis of variance. Biometrics, pages 99–114, 1949.
- Humphrey and Bello  Eric J Humphrey and Juan P Bello. Four timely insights on automatic chord estimation. In Proceedings of the 16th Conference of the International Society for Music Information Retrieval (ISMIR), 2015.
- Koops et al.  Hendrik Vincent Koops, W Bas de Haas, and Anja Volk. Integration of crowd-sourced chord sequences using data fusion. In Proceedings of the 16th International Society for Music Information Retrieval Conference, ISMIR, 2015.
- Klein  Lawrence A Klein. Sensor and data fusion: a tool for information assessment and decision making. 324, 2004.
- Chawla et al.  Nitesh V Chawla, Nathalie Japkowicz, and Aleksander Kotcz. Editorial: special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter, 6(1):1–6, 2004.