Matching of a jingju (Beijing opera) singing phrase (.wav) with a score corpus (.xml)
We approach the singing phrase audio to score matching problem by using phonetic and duration information - with a focus on studying the jingju a cappella singing case. We argue that, due to the existence of a basic melodic contour for each mode in jingju music, only using melodic information (such as pitch contour) will result in an ambiguous matching. This leads us to propose a matching approach based on the use of phonetic and duration information. Phonetic information is extracted with an acoustic model shaped with our data, and duration information is considered with the Hidden Markov Models (HMMs) variants we investigate. We build a model for each lyric path in our scores and we achieve the matching by ranking the posterior probabilities of the decoded most likely state sequences. Three acoustic models are investigated: (i) convolutional neural networks (CNNs), (ii) deep neural networks (DNNs) and (iii) Gaussian mixture models (GMMs). Also, two duration models are compared: (i) hidden semi-Markov model (HSMM) and (ii) post-processor duration model. Results show that CNNs perform better in our (small) audio dataset and also that HSMM outperforms the post-processor duration model.READ FULL TEXT VIEW PDF
In this letter we borrow from the inference techniques developed for
Convolutional Neural Networks (CNNs) are effective models for reducing
Chord recognition systems typically comprise an acoustic model that pred...
Speaker clustering is the task of differentiating speakers in a recordin...
We study indeterminacies in realization of ornaments and how they can be...
In this work we present an algorithm for composing monophonic melodies
Consonant gemination in Italian affricates and fricatives was investigat...
Matching of a jingju (Beijing opera) singing phrase (.wav) with a score corpus (.xml)
The ultimate goal of our research project is to automatically evaluate the jingju a cappella singing of a student in the scenario of jingju singing education – see Figure 1. Jingju, a traditional Chinese performing art also known as Peking or Beijing opera, is extremely demanding in the clear pronunciation and accurate intonation of each syllabic or phonetic singing unit. To this end, during the initial learning stages, students are required to completely imitate tutor’s singing. Therefore, the automatic jingju singing evaluation system we envision is based on this training principle and measures the intonation and pronunciation similarities between the student’s and the tutor’s singings. Before measuring similarities, the singing phrase should be automatically segmented into syllabic or phonetic units in order to capture the temporal details. Jingju music scores, which contain the phonetic and duration information for each singing syllable, will be beneficial for this segmentation. In the application scenario, the score of a query audio could be selected from the database by the user itself. However, to avoid manual intervention and improve the user experience, we tackle the problem of automatically finding the corresponding music score for a given query audio (bold in fig:design_framework). Note that achieving successful methods for audio to score matching might be beneficial for several music informatics research (MIR) tasks, such as: score-informed automatic syllable/phoneme segmentation  or score-informed source separation . The objective of this research task is to find the corresponding score for a given singing audio query. We restrict this research to the “matching” scope by pre-segmenting both the singing audios and the music scores into the phrase units.
Xipi and erhuang are the main modes in jingju music. Each has two basic melodic contours – an opening phrase and a closing phrase. Each basic melodic contour is constructed upon characteristic pitch progressions for each mode . Therefore, singing phrases from different arias sharing the same mode are likely to have a similar melodic contour. fig:melodic_contour_similar shows an example of this fact.
However, melodic information tends to be intuitively used for such matching tasks. For example in Query-by-Singing/Humming (QBSH) , melodic similarities can be obtained by comparing the distance between the F0 contour of the query audio and those synthesized from the candidate scores. Then, the best matched music score can be retrieved by selecting the most similar melody. But note that using this approach for jingju music would bring matching errors since the melodic contours of the same mode are similar in this sense. In this case, it is more appropriate to use another notion of similarity. We propose using the lyrics since the stories narrated in different jingju arias are distinctive and lyrics tend to change through different jingju arias – even when they share the same mode. Therefore, phonetic information might be useful to identify a similar score given a query audio.
QBSH is the most related research task to our study, which retrieves a song by singing a portion of itself. Most of the studies use melody information as the only cue. The typical process of such systems was introduced by Molina et al. : firstly, the F0 contour and/or a note-level transcription for a given singing query are extracted; and then, a set of candidate songs are retrieved from a large database using a melodic matcher module. The most successful QBSH system, which obtained the best results in MIREX 2016 contest, is based on the method of multiple similarity measurements fusion . This system proposed a melodic matcher which combines several similarities that are note-based and frame-based. The authors claim that the fusion mechanism improves the query performance because no similarity measurement is perfect. Therefore, information sources that are complementary to each other might be beneficial for this approach. Very few studies have explored the capability of the phonetic information for QBSH. Guo et al.  and Wang et al.  both used a lyric recognizer based on Hidden Markov Models (HMMs). Their recognition networks111The topology of the HMM is defined by the recognition network. were constructed with the phonetic information from the query candidates database. They used frame-based MFCCs to create the acoustic models with GMMs. Then, the Viterbi algorithm was executed over the recognition networks to either obtain the most likely phonetic state sequence (for Wang et al. ) or the posterior probability of each possible decoding path (for Guo et al. ). The final score of a query candidate is either based on semantic similarity  or based on the posterior probability of its corresponding lyrics .
Another research task related to our study is singing keyword spotting. The main goal of this task is to search for one or more keywords in a singing query. The system proposed by Kruspe  searches for a specific singing keyword on the resulting phoneme observations. A keyword-filler HMMs is employed for this purpose. She used two phoneme duration models: the HSMM and the post-processor duration model.
Finally, both phonetic and duration information extracted from the score have been extensively used in alignment-related tasks, such as audio-to-score alignment and audio-to-lyrics alignment. For example, Gong et al.  construct a left-to-right HSMM using phonetic and duration information. Or Dzhambazov et al.  use a similar approach for aligning polyphonic audio. Analogously, the proposed approach explores the use of both phonetic and duration information (available in scores) to tackle the matching ambiguity problem existing in jingju music.
The remainder of this paper is organized as follows: the used dataset is introduced in section 2, section 3 explains the modules of the proposed approach – detailing how to incorporate phonetic and duration information. Experiments and results are reported in section 4, and section 5 concludes and points out future work.
The jingju a cappella singing dataset is composed of two overlapping parts: (i) audio and (ii) score datasets.
The audio dataset  used for this study consists of two role-types singing: dan (young woman) and laosheng (old man). The dan part of this dataset has 42 recordings sung by 7 singers and the laosheng part has 23 recordings sung by 7 laosheng singers. The boundary annotations of the audio dataset have been done in Praat format (textgrid) considering a hierarchy of three levels: phrase, syllable and phoneme – using Chinese characters, pinyin notations and X-SAMPA notations, respectively. 32 phoneme classes are used in the phoneme-level annotation. Two Mandarin native speakers and a jingju musicologist have been devoted to this annotating work. Annotations and more detailed information can be found online222http://doi.org/10.5281/zenodo.344932. Some statistics about the dataset are reported in table:detailInfoDataset. The average phrase, syllable and voiced phoneme length of dan singing are ostensibly greater than those of laosheng singing (bold numbers in table:detailInfoDataset), which might indicate that dan singing tends to have more pitch variation and ornamentation – as we could observe empirically by listening to the data.
|Num.||Avg. len (s)||Std. len (s)|
The audio dataset, along with their boundary annotations, is split into three parts: training set, development (dev) set and test set. We define the training set to be the non-overlapping part with the score dataset, see fig:dataset_intersection. The training set will be used for calculating the phonetic duration (duration information) and training acoustic models (phonetic information). After taking the training set out, we define the development set to be the half of the remaining phrases in the audio dataset (randomly selected) – it will be used for parameters optimization. The test set consists on the remaining phrases of the audio dataset – it will be used for testing the acoustic models performance and the matching performance.
On the other hand, the score dataset contains 435 dan phrases and 481 laosheng phrases. The scores have been typed in stave notation (including lyrics in Chinese characters) using MuseScore from different printed sources in jianpu notation. Since tempo is usually not clearly noted in the printed score, we do not include this information in the dataset. The relative syllabic durations are indicated by the note durations corresponding to the lyrics, which will be used to calculate the phonetic duration (duration information) and the matching network. The whole score dataset will be used as candidates for testing the matching performance and for parameter optimization.
The proposed approach aims to match the query audio to its score by using phonetic and duration information. During the training process (red boxes in fig:main_diagram): the acoustic models of each phoneme are shaped by using the audio training set and its phonetic boundary annotations; the score dataset is used to construct a matching network; and phoneme duration distributions are estimated by using both audio training set and scores. During the matching process (green boxes in fig:main_diagram): two duration models –HSMM and post-processor– are explored for the Viterbi decoding step. Finally, the best matched phrase is found by ranking the decoded state sequence probabilities.
Here presented acoustic models aim to represent the relationship between an audio signal and the 32 phoneme classes present in our dataset. The output of these models yield probability scores for each phoneme class.
. For that reason, we set as baseline a 40-component GMM with the following input vector: 13 MFCCs, their deltas and delta-deltas. Moreover, DNNs have been found very useful for acoustic modeling[9, 13]. Therefore, we propose an additional baseline: a DNN with 2 hidden layers followed by the 32-way softmax output layer – the input is set to be a log-mel spectrogram.
However, DNNs are very prone to over-fitting and the available dataset is relatively small. For that reason we propose using CNNs since these are more robust against over-fitting – note that CNNs allow parameter sharing. Additionally, Pons et al.  have successfully used spectrograms-based CNNs for learning music timbre representations from small datasets. Given that timbre is an important feature for acoustic modeling, we propose using the same architecture: a single convolutional layer with filters of various sizes [17, 16]. The input is set to be a log-mel spectrogram. We use filters of sizes and , filters of sizes and , and filters of sizes and
– where the first and second numbers denote the frequential and temporal size of the filter, respectively. A max-pool layer ofis followed by a 32-way softmax output layer with 30% dropout – where denotes the temporal dimension of the resulting feature map. max-pool layer was chosen to achieve time-invariant representations while keeping the frequency resolution. And samepadding is used to preserve the dimensions of the feature maps so that these are concatenable. Filter shapes are designed so that filters can capture the relevant time-frequency contexts for learning timbre representations – according to the design strategy proposed by Pons et al. 
Log-mel spectrograms are of size – the network takes a decision for a frame given its context:
10ms, 21 frames in total. Activation functions are ELUs11]
and early stopping – when validation loss (categorical cross-entropy) does not decrease after 10 epochs.
Spectrograms are computed from audio recordings sampled at 44.1 kHz. STFT is performed using a window length of 25ms (2048 samples with zero-padding) with a hop size of 10ms. The 80 log-mel bands energies are calculated on frequencies between 0Hz and 11000Hz and these are standardized to have zero mean and unit variance.
The acoustic models are trained separately for each role-type and their performance is reported in section 4.2.
The matching network defines the topology of the hidden Markov model. By using each candidate phrase in the score dataset as an isolated unit, isolated-phrase matching networks can be constructed. fig:matching_network shows the structure of this matching network, which has lyric paths.
The matching network uses HMMs or HSMMs, depending on how the internal duration is modeled. Each path is a left-to-right state chain which represents the phoneme transcription of its lyrics. In order to construct the lyric path, pinyin lyrics are segmented into phonetic units and transcribed into X-SAMPA notations by using a predefined dictionary. For example, a path which has the lyrics yan jian de hong ri in pinyin is a chain consisting of 12 states: j, En, c, j, En, c, 7, x, UN, N, r\’, 1 in X-SAMPA notation. When the decoding process has finished, each lyric path can get a posterior probability which will be used as the similarity measure between the query phrase and the candidate phrase.
Phonetic duration information comes from two sources: the boundary annotations of audio training dataset and the score dataset. The phonetic duration is not directly indicated in the score. However, it is indispensable for modeling the phonetic duration distribution for each state in the matching network. The syllable, of which duration can be deduced by the corresponding note(s), is used to restrict the durations of the phonemes.
In the following, we propose a method for estimating the absolute phonetic duration given: (i) the score, and (ii) the phonemes duration histograms computed from the audio dataset annotations. First, we omit silence parts in the query audio (with a simple voice activity detection method ) and also in the score by removing the rest notes. Second, we compute the duration histogram and its duration centroid for each phoneme class – by aggregating the phonetic durations indicated in the boundary annotations of the audio training dataset. Then, we segment each syllabic duration in the score dataset into phonetic durations according to the proportion of their duration centroids. Finally, as the scores do not contain tempo, we normalize the phonetic durations of each phrase such that their summation is equal to the duration of the query audio. See fig:phoneme_duration for an equivalent graphic explanation. In fig:phoneme_duration, the centroid durations of these three phonemes are: 0.46s, 0.9s and 0.1s, summing: 1.46s – alternatively, these can be expressed as a proportion of 1.46s: 0.32, 0.62 and 0.06. With these proportions and the absolute syllable duration (2s), we can compute the absolute phoneme durations: 0.322s = 0.64s, 0.622s=1.24s and 0.062s=0.12s.
The phonetic duration distribution needs to be calculated for each state in the matching network in order to incorporate the a priori
phonetic duration information. We model it by Gaussian distributions:
where is the duration of the phoneme
deduced by the above method and the standard deviationis proportional to : . The proportionality constant will be optimized in section 4.3 for each role-type.
Standard Markovian state do not impose explicitly duration distribution, instead, imposing an implicit state occupancy distribution which corresponds to a “1-shifted” geometric distribution:
where denotes the occupancy or sojourn time in a Markovian state and denotes the self-transition probability of the state . Because of the implicity of the Markovian state occupancy, the phonetic duration distribution introduced in section 3.3 can not be imposed. Kruspe  presents two duration modeling techniques for HMMs: Hidden semi-markov model (HSMM) and post-processor duration model.
defined a semi-Markov chainwith finite state space by the following parameters:
initial probabilities with
transition probability of semi-Markovian state : for each with and
An explicit occupancy distribution is attached to each semi-Markovian state:
where denotes the upper bound to the time spent in state . defines the conditional probability of leaving state at time and entering state at time .
To apply HSMMs to the matching network, we first use the matching network as the HSMMs topology, thus the state occupancy distribution is set to its corresponding phonetic duration distribution. Then the probabilities of each left-to-right state transition are set to 1 because all self-transition probabilities in HSMMs are 0. The goal is to find the most likely sequence of hidden states for each lyric path and collect its posterior probability. The Viterbi algorithm meets this exact goal and its complete implementation is provided in .
The post-processor duration model was first introduced by Juang et al. . It was then experimentally proved in Kruspe’s paper  that this duration model works better than HSMMs for the keyword spotting task in English pop singing voice. The post-processor duration model uses the original HMMs Viterbi algorithm – therefore, during the decoding process no explicit occupancy duration distribution is imposed.
The log posterior probability of the decoded most likely state sequence is augmented by the log duration probabilities:
where is the HMMs posterior probability, is a weighting factor which will be optimized in section 4.3, is the decoded state number in the most likely state sequence, and is the occupancy probability of being in state for the occupancy .
Two experiments333 Code:https://github.com/ronggong/jingjuSingingPhraseMatching/tree/v0.1.0
are performed: the first is to evaluate the performance of the acoustic models, and the second is to evaluate the proposed matching approaches. For the first task, we use one simple evaluation metric: the overall classification accuracy which is defined as the fraction of instances that are correctly classified. For the second task, our goal is to evaluate the ability of matching the ground-truth phrase in the score dataset to the query one, which is almost identical to the goal of a QBSH system:”finding the ground-truth song in a song database from a given singing/humming query”. Therefore, we borrow the standard performance metrics used in QBSH task to evaluate our approaches: Top-M hit and Mean Reciprocal Rank (MRR) . The Top-M hit rate is the proportion of queries for which , where denotes the rank of the ground-truth score phrase. MRR is the average of the reciprocal ranks across all queries, is the number of queries, and is the posterior probability rank of the ground-truth phrase corresponding to the i-th query.
CNN, DNN and GMM acoustic models yield probability scores for each phoneme class. In order to evaluate the classification accuracy, we choose the phoneme class with the maximum probability score as the prediction. table:performance_am reports the performance of CNN, DNN and GMM acoustic models evaluated on the test set.
The relatively low classification accuracies for all three models show that modeling the phonetic characteristics for jingju singing voice is a challenging problem. Our best results are achieved with CNNs – and GMMs perform better than DNNs. Interestingly, these results contrast with the literature where Hinton et al.  describe that DNNs acoustic models largely outperform GMMs for automatic speech recognition, and Maas et al.  showed that CNNs perform worse than DNNs for building speech acoustic models. First, we argue that in our case DNNs perform worse than GMMs and CNNs because a small amount of training data is available. DNNs require a lot of training data to achieve good performance and note that large amounts of training data are typically not available for most MIR tasks. And second, note the CNNs used here are specifically designed to efficiently learn timbre representations  while Maas et al. 
used small squared filters, which proved successful in computer vision tasks. These results show that using CNN architectures designed for the task at hand is specially beneficial in small data scenarios. A CNN model is used in the following experiments
The parameters which need to be optimized for dan and laosheng role-types are: the weighting factor for the post-processor duration model, and the proportionality constant for both models: HSMMs and post-processor duration model. Table 3 reports the optimal values we obtained by doing grid search on the development set – MRR metric was maximized.
|with step 0.25||1.0 / 1.0|
|HSMMs||with step 0.1||0.1 / 0.1|
|post-processor||with step 0.1||0.7 / 1.5|
To highlight the advantage of using duration modeling methods for audio to score matching, a standard HMM without explicitly imposing the occupancy distribution is used as a baseline. Results in fig:evaluation show that its performance is inferior than the HSMM duration model.
One can also observe in fig:evaluation that HSMM performs the best, improving the baseline MRR metric performance by 13.2% for dan role-type and 15.1% for laosheng role-type. This means that HSMMs explicit duration modeling can help achieve a better audio to score matching by using phonetic information.
The post-processor duration model does not significantly improve the baseline performance. This result contrasts with the literature, where the post-processor duration model worked better than HSMMs for singing voice keyword spotting . This inconsistency might result from (i) the length difference of the matching unit (singing-words vs. singing-phrases), and (ii) the large standard deviation of the jingju singing phonemes length. First, in Kruspe’s work , the matching unit is the singing keyword – which usually contains less phonemes than a singing phrase (as in our case). And second, the vowel length standard deviation of the a cappella dataset used by Kruspe  (around 0.3s) is much short than in our dataset (dan: 0.97s, laosheng: 0.78s) – denoting less vowel duration variance than in our study case. Moreover, a significant deficiency of the post-processor duration model is that it does not provide the most likely state sequence by internally considering the durations, but it computes a new weighted likelihood given the obtained sequence . If the most likely state sequence is decoded badly, it can’t be restored by the post-processor duration model.
In this paper we presented an audio to score matching approach that uses phonetic and duration information.
We explored two duration models: HSMM and post-processor duration model. HSMMs achieved better results than the post-processor duration model – probably due to (i) the matching units length, (ii) the large standard deviation of the considered phonemes, and (iii) because for the post-processor duration model it is hard to recover a decoding mistake. Moreover HSMMs achieved a better matching performance than the baseline-HMMs approach, which only took into account phonetic information, denoting the utility of using duration information.
We also compared CNN, DNN and GMM acoustic models, and CNNs have shown to be superior in our small singing voice audio dataset. The used CNN architecture was specifically designed to learn timbral representations efficiently  – this being the key factor for enabling CNNs (a deep learning method requiring large amounts of data) to perform so well in such a small dataset.
There are many possibilities to improve our approach. It has been shown in the speech research field that LSTM RNNs achieved the best acoustic modeling performance . However, this method requires a large training dataset in order prevent from over-fitting. Another possibility to improve our acoustic model is to go deeper with the current single-layer CNN architecture, but this will also require more training data. We plan to collect more jingju a cappella singing recordings and perform data augmentation to leverage the capability of the acoustic models. Furthermore, in order to take advantage of the melodic information existing in both audio and score datasets, we also plan to investigate methods which can fuse melodic, phonetic and duration information.
We are grateful for the GPUs donated by NVidia. This work is partially supported by the Maria de Maeztu Programme (MDM-2015-0502) and by the European Research Council under the European Union’s Seventh Framework Program, as part of the CompMusic project (ERC grant agreement 267583).