Log In Sign Up

Learning Frame Similarity using Siamese networks for Audio-to-Score Alignment

Audio-to-score alignment aims at generating an accurate mapping between a performance audio and the score of a given piece. Standard alignment methods are based on Dynamic Time Warping (DTW) and employ handcrafted features, which cannot be adapted to different acoustic conditions. We propose a method to overcome this limitation using learned frame similarity for audio-to-score alignment. We focus on offline audio-to-score alignment of piano music. Experiments on music data from different acoustic conditions demonstrate that our method achieves higher alignment accuracy than a standard DTW-based method that uses handcrafted features, and generates robust alignments whilst being adaptable to different domains at the same time.


A Hybrid Approach to Audio-to-Score Alignment

Audio-to-score alignment aims at generating an accurate mapping between ...

MIDI-Sheet Music Alignment Using Bootleg Score Synthesis

MIDI-sheet music alignment is the task of finding correspondences betwee...

A Convolutional-Attentional Neural Framework for Structure-Aware Performance-Score Synchronization

Performance-score synchronization is an integral task in signal processi...

Towards Context-Aware Neural Performance-Score Synchronisation

Music can be represented in multiple forms, such as in the audio form as...

Exact, Parallelizable Dynamic Time Warping Alignment with Linear Memory

Audio alignment is a fundamental preprocessing step in many MIR pipeline...

Real-time visualisation of fugue played by a string quartet

We present a new system for real-time visualisation of music performance...

Audio-to-Score Alignment using Transposition-invariant Features

Audio-to-score alignment is an important pre-processing step for in-dept...

I Introduction

The significance of neural networks for signal processing was pointed out early by

[21, 27], and their efficacy for Music Information Retrieval (MIR) has been demonstrated for a variety of tasks like music generation [13], music transcription [19] as well as music alignment [11]. Audio-to-score alignment is the task of finding the optimal mapping between a performance and the score for a given piece of music. Dynamic Time Warping (DTW) [33] has been the de facto standard for this task, typically incorporating handcrafted features [10, 15, 5]. The primary limitation of handcrafted features lies in their inability to adapt to different acoustic settings and thereby model real world data in a robust manner, in addition to not being optimized for the task at hand.

This paper presents a novel method for DTW-based audio-to-score alignment, which does not depend on handcrafted features, but learns them directly from the music data at the frame level. We propose learning a frame similarity matrix using neural networks which is then passed on to a DTW algorithm that computes the optimal warping path through the matrix, yielding the alignment. We propose the use of twin Siamese networks [8] each containing a Convolutional Neural Network (CNN) [26] architecture for learning frame similarity. The advantage of our method is that it is is efficiently able to learn meaningful representations for DTW directly from data and is thereby adaptable to different acoustic settings.

We conduct experiments on piano music using our approach and test its performance on the Mazurka dataset [34], which contains recordings from different eras spanning various acoustic conditions; and demonstrate improvements over MATCH [9], a standard DTW-based method that uses handcrafted features. We additionally explore two methods to improve the performance of our baseline models, namely salience representations [7] and data augmentation.

To the authors’ knowledge, this is the first method to employ learned frame similarities using Siamese CNNs for audio-to-score alignment. Additionally, this is the first method to incorporate pitch salience for audio-to-score alignment to the authors’ knowledge. The rest of the paper is organized as follows: We describe prior work and our relation to it in Section II. Section III details our proposed method and model pipeline. The experimentation conducted and results obtained using our method are described in Section IV. We present the conclusions of the present research and highlight possible directions for future work in Section V.

Ii Related Work

Early works on feature learning for Music Information Retrieval (MIR) employ algorithms like Conditional Random Fields [25]

or deep belief networks

[35], whereas recent work in this direction is moving towards the usage of deep neural networks [36].

Fig. 1: Model Pipeline
: convolution layer : pooling layer
: Flatten layer : Fully connected layer

Work specifically on learning features for audio-to-score alignment has focused on the evaluation of current feature representations [23]

, learning features for alignment using a Multi Layer Perceptron

[22], and learning a mapping several common audio representations based on a best-fit criterion [24]. Recently, transposition-invariant features were proposed for music alignment [4], however these features while being robust to transposition, are sensitive to large tempo variations and underperform in such situations. [12]

is a recent work on score following, a task related to audio-to-score alignment. While they employ reinforcement learning to train a score follower in real time, we focus on robust offline alignment across various acoustic conditions using frame similarity learning.

Another direction which sets the context for our work is sound similarity; approaches to which include capturing music segment similarity using two-dimensional Fourier-magnitude coefficients [31], similarity network fusion to combine different frame-level features for hierarchy identification of repeated sections in music [37], and application of Siamese Neural Networks for content-based audio retrieval [28]. The closest work to ours which employs the notion of learned sound similarity for music alignment is [22], to the authors’ knowledge. While they use a Multi-Layer Perceptron to compute if two frames are the same or not, we compute frame similarity using Siamese CNNs. In addition to using an enhanced framework which is suitable for the similarity detection task, our work differs from them in that we also compute the extent of similarity in the form of non-binary distances and use this distance (or dissimilarity) matrix further for alignment. We additionally employ deep salience representations, which prove to be an effective method to improve alignment accuracy over our baseline models.

Iii Proposed Methodology

We propose a novel method for DTW-based audio-to-score alignment that uses Siamese neural networks. We additionally employ deep salience representations [7] to improve model performance in data-scarce conditions. We describe the method in detail in the subsequent subsections.

Iii-a Siamese Convolutional Neural Networks

The standard feature representation choice for music alignment is a time-chroma representation [6] generated from the log-frequency spectrogram, which is not trainable on real data, and thereby not adaptable to different acoustic settings. We override the feature engineering step and focus on learning frame similarity using Convolutional Neural Networks (CNNs), since they can jointly optimize the representation of input data conditioned on the similarity measure being used. We employ a Siamese Convolutional Neural Network, a class of neural network architectures that contains two or more identical subnetworks [8] for this task.

We train a Siamese CNN, akin to that prototyped in [2], to compute a frame similarity matrix to be fed to DTW to generate alignment. Figure 1 gives an overview of our model pipeline. In order to keep the modality constant, we first convert the MIDI files to audio through FluidSynth [20]

using piano soundfonts. The two audio inputs are converted to a low-level spectral representation using a Short Time Fourier Transform, with a hop size of 23 ms and a hamming window of size 46 ms. Our training data contains synchronized audio and MIDI files, so it is straightforward to extract matching frame pairs. For each matching pair, we randomly select a non-matching pair (using MIDI-information) in order to have a balanced training set. The inputs to the Siamese network are labelled frame pairs from the performance audio and the synthesized MIDI respectively. We employ the contrastive loss function

[18] while training our models. We choose this formulation over a standard classification loss function like cross entropy since our objective is to differentiate between two audio frames. Let be the pair of inputs and , be the set of parameters to be learnt and be the target binary label (

= 0 if they match and 1 if otherwise). Task-specific loss functions have shown promising results in the fields of image processing and natural language processing

[32, 3]. The contrastive loss function for each tuple is computed as follows:


where is the margin for dissimilarity and is the Euclidean Distance between the outputs of the subnetworks. Pairs with dissimilarity greater than do not contribute to the loss function. More formally, can be expressed as follows:


where is the output of each twin subnetwork for the inputs and . Since it is a distance-based loss, it tries to ensure that semantically similar examples are embedded close to each other, which is a desirable trait for extracting alignments.

The Siamese network thus learns to classify the sample pairs as similar or dissimilar. This is done for each audio frame pair and the similarity matrix thus generated is then passed on to a DTW-based algorithm to generate the alignment path. DTW generates an alignment between two sequences

= and = by comparing them using a local cost function, at each point, with the goal of minimizing the overall cost. The path which yields this minimum overall cost is then the optimal alignment between the two sequences. Formally, it can be represented as follows:


where is the distance measure (local cost) between points and ; and is the total cost for the path which generates the optimal alignment between the sequences and . We employ Euclidean distance as our distance measure and the DTW framework from [17] to compute the warping paths.

Iii-B Deep Salience Representations

Fig. 2: Salience representations to address data sparsity

We employ deep salience representations [7]

for effective training of our models. These are time-frequency representations aimed at estimating the likelihood of a pitch being present in the audio. Figure

2 shows an example of a salience representation. The primary motivation behind using such a representation is that it de-emphasizes non-pitched content and emphasizes harmonic content, thereby aiding training in data-scarce conditions. We employ the model proposed by [7], trained to learn a series of convolutional filters, constraining the target salience representation to have values between 0 and 1, with larger values corresponding to time-frequency bins where fundamental frequencies are present. The model is trained to minimize the cross entropy loss as follows:


where both and are continuous values between 0 and 1.

We compare the performance obtained using salience representations with that obtained using the Short-Time Fourier Transform (STFT) and Constant-Q Transform (CQT) of the raw audios. We employ these input representations for comparative purposes. We employ a hop size of 23 ms and a hamming window of size 46 ms. We employ a CQT with 24 bins per octave, with the first bin corresponding to frequency 65.4 Hz (midi note C2).

Iv Experiments and Results

Iv-a Experimental Setup

We employ the MAPS database [14], the Saarland database [30] and the Mazurka dataset [34] for our experiments. From the original MAPS database, which contains synthesized MIDI-aligned audio for a range of acoustic settings, we select the subset MUS containing complete pieces of piano music, and append it to the Saarland database. We split the resultant database comprising 288 recordings randomly into sets of 230 and 58 recordings. These sets form our training and validation sets respectively. We test the performance of our models on the Mazurka dataset [34]

, which contains recordings of Chopin’s Mazurkas dating from 1902 to the early 2000s, thereby spanning across various acoustic settings. This dataset contains annotations of beat times for five Mazurka pieces. The alignment error for these pieces has a standard deviation of 11 ms.

Type of layer Input size Kernels Kernel size
Convolution 64
Max-Pooling 1
Convolution 128
Max-Pooling 1
Convolution 256
Max-Pooling 1
Convolution 512
Flatten - -
Fully Connected - -
TABLE I: Architecture of our model
Model Binary Matrix Distance Matrix
25ms 50ms 100ms 200ms 25ms 50ms 100ms 200ms
[9] - - - - 64.8 72.1 77.6 83.7
- - - - 62.9 70.5 76.3 82.4
[22] 63.8 69.5 77.2 83.4 - - - -
65.6 71.9 78.1 84.8 67.2 73.4 78.7 85.6
66.4 73.1 78.7 85.3 68.1 74.8 80.1 86.7
67.1 74.6 79.2 86.1 69.4 75.1 80.7 87.2
68.2 75.3 81.4 87.8 70.3 76.7 82.1 88.4
67.9 74.4 80.8 86.7 69.6 75.4 81.6 87.9
69.4 76.4 81.2 87.5 71.7 78.2 83.3 90.1
TABLE II: Results of our models

Our Siamese model has four convolutional layers of varying dimensionality followed by a fully connected layer to generate the similarity output. The outputs of each layer are passed through rectified linear units in order to add non-linearity, followed by batch normalization before being passed as inputs to the next layer. The detailed architecture of our model is given in Table


We conduct experiments using two different mechanisms for computing the similarity matrix :

  • Using binary labels: We directly employ the outputs of the Siamese CNN, whereby 0 and 1 correspond to similar and dissimilar pairs respectively.

  • Using distances: We employ the distance computed as part of the loss, which directly corresponds to the dissimilarity between the two inputs.

We generate an alignment path through this matrix using DTW, through a readily available implementation in Python [17]. For our Siamese models trained without data augmentation, the naming convention we employ is , where is the feature representation used during training. We also report results obtained using data augmentation. We generate 20% additional training samples by employing a random pitch shift of up to ents, using librosa [29]. These models are named and for the CQT and the salience representations respectively.

Iv-B Results and Discussion

We compare the performance of our models with MATCH [9]; a DTW algorithm using Chroma features [6]; and the Multi-Layer Perceptron Model proposed by [22] (). We compute the error = - , defined as the time difference between the alignment positions of corresponding events in the reference and the estimated alignment time for score event . We show results for accuracy in percentage for events which are aligned within an error of up to 25 ms, 50ms, 100ms and 200ms respectively. The results obtained by our models are given in Table II.

Our models outperform DTW-based algorithms that employ handcrafted features as well as an MLP framework which learns binary similarity labels (Table II, rows 1-5). The CQT representation () yields better results than the STFT representation (), we argue that this is due to the nature of the CQT, which is a more musically meaningful representation. Our Siamese model trained using the Chroma representation () outperforms the DTW-based method using the same representation (), suggesting that frame similarity learnt from real data is effective at generating robust alignment. Additionally, we observe the trend that the models trained using a non-binary distance matrix outperform those trained on binary matrices (Table II, columns 6-9). We speculate that thresholding the similarity into binary labels discards potentially useful information and the distances facilitate the DTW algorithm to take better long-term decisions. Both salience representations () and data augmentation () prove to be effective to improve the performance of our model over , with salience representations contributing to greater improvements. We posit that using salience representations makes it easier for the model to learn meaningful features from the input representations, since it emphasizes pitched content. Improvements using data augmentation can be attributed to the fact that pianos are not always tuned to in the real world, and often the relative intervals are also not tuned perfectly, hence comparison with MIDI files in such cases might lead to false negatives. Data augmentation ensures that the disparity between our training and test conditions is minimized by simulating more real-world like conditions in our training data. A combination of distance matrix, salience representations and data augmentation yields the best results (), as can be seen from Table II, row 8, columns 6-9.

Our results demonstrate that frame similarity learning using Siamese neural networks is a promising method for audio-to-score alignment. The principal advantage of this approach over traditional feature choices (like chroma features or MFCCs) is the ability to learn directly from data, which provides higher relevance and adaptability. Both the Siamese network and the pitch salience network are trainable, and thereby adaptable to real world conditions. We plan to explore domain adaptation of our models in the future. A limitation of our method is that it cannot handle structural changes, since DTW generates a monotonically increasing warping path. This could potentially be mitigated by employing an enhanced DTW framework like jump-DTW [16] alongside our Siamese model.

V Conclusion and Future Work

We presented a novel method for offline audio-to-score alignment using learned similarities via a Siamese convolutional network architecture. We demonstrated that our approach is capable of generating robust alignments for piano music across various acoustic conditions. Our models outperform traditional methods based on Dynamic Time Warping that rely on handcrafted features, as well as a Multi Layer Perceptron model which learns binary similarity between audio frames. We also demonstrated that salience representations and data augmentation are effective techniques to improve alignment accuracy. In the future we plan to incorporate attention into the convolutional models to aid training and improve performance. We would also like to explore other model architectures and work on learning the features as well as the alignments in a completely end-to-end manner.


  • [1] R. Agrawal and S. Dixon (2019) A hybrid approach to audio-to-score alignment. ML4MD at ICML. Cited by: §V.
  • [2] R. Agrawal and S. Dixon (2020) A hybrid approach to audio-to-score alignment. arXiv preprint arXiv:2007.14333. Cited by: §III-A.
  • [3] T. Amirhossein, R. R. Agrawal, R. Chatterjee, M. Negri, and M. Turchi (2018) Multi-source transformer with combined losses for automatic post editing. In Third Conference on Machine Translation (WMT), pp. 859–865. Cited by: §III-A.
  • [4] A. Arzt and S. Lattner (2018) Audio-to-score alignment using transposition-invariant features. In International Society for Music Information Retrieval, Cited by: §II.
  • [5] A. Arzt, G. Widmer, and S. Dixon (2012) Adaptive distance normalization for real-time music tracking. In Proceedings of the 20th European Signal Processing Conference (EUSIPCO), pp. 2689–2693. Cited by: §I.
  • [6] M. A. Bartsch and G. H. Wakefield (2005) Audio thumbnailing of popular music using chroma-based representations. IEEE Transactions on Multimedia 7 (1), pp. 96–104. Cited by: §III-A, §IV-B.
  • [7] R. M. Bittner, B. McFee, J. Salamon, P. Li, and J. P. Bello (2017) Deep salience representations for f0 estimation in polyphonic music.. In International Society for Music Information Retrieval (ISMIR), pp. 63–70. Cited by: §I, §III-B, §III.
  • [8] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah (1994) Signature verification using a siamese time delay neural network. In Advances in Neural Information Processing Systems, pp. 737–744. Cited by: §I, §III-A.
  • [9] S. Dixon and G. Widmer (2005) MATCH: a music alignment tool chest.. In International Society for Music Information Retrieval, pp. 492–497. Cited by: §I, §IV-B, TABLE II.
  • [10] S. Dixon (2005) An on-line time warping algorithm for tracking musical performances.. In IJCAI, pp. 1727–1728. Cited by: §I.
  • [11] M. Dorfer, J. Hajič Jr, A. Arzt, H. Frostel, and G. Widmer (2018) Learning audio–sheet music correspondences for cross-modal retrieval and piece identification. Transactions of the International Society for Music Information Retrieval 1 (1). Cited by: §I.
  • [12] M. Dorfer, F. Henkel, and G. Widmer (2018) Learning to listen, read, and follow: score following as a reinforcement learning game. In International Society for Music Information Retrieval, Cited by: §II.
  • [13] D. Eck and J. Schmidhuber (2002)

    A first look at music composition using lstm recurrent neural networks

    In IDSIA Technical Report IDSIA-07-02, Cited by: §I.
  • [14] V. Emiya, R. Badeau, and B. David (2009) Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle. IEEE Transactions on Audio, Speech, and Language Processing 18 (6), pp. 1643–1654. Cited by: §IV-A.
  • [15] S. Ewert, M. Muller, and P. Grosche (2009) High resolution audio synchronization using chroma onset features. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1869–1872. Cited by: §I.
  • [16] C. Fremerey, M. Müller, and M. Clausen (2010) Handling repeats and jumps in score-performance synchronization.. In International Society for Music Information Retrieval (ISMIR), pp. 243–248. Cited by: §IV-B.
  • [17] T. Giorgino et al. (2009) Computing and visualizing dynamic time warping alignments in r: the dtw package. Journal of Statistical Software 31 (7), pp. 1–24. Cited by: §III-A, §IV-A.
  • [18] R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In

    IEEE Computer Society Conference on Computer Vision and Pattern Recognition

    Vol. 2, pp. 1735–1742. Cited by: §III-A.
  • [19] C. Hawthorne, E. Elsen, J. Song, A. Roberts, I. Simon, C. Raffel, J. Engel, S. Oore, and D. Eck (2018) Onsets and frames: dual-objective piano transcription. In ISMIR, Cited by: §I.
  • [20] D. Henningsson and F. D. Team (2011) FluidSynth real-time and thread safety challenges. In Proceedings of the 9th International Linux Audio Conference, Maynooth University, Ireland, pp. 123–128. Cited by: §III-A.
  • [21] J. Hwang, S. Kung, M. Niranjan, and J. C. Principe (1997) The past, present, and future of neural networks for signal processing. IEEE Signal Processing Magazine 14 (6), pp. 28–48. Cited by: §I.
  • [22] Ö. İzmirli and R. B. Dannenberg (2010) Understanding features and distance functions for music sequence alignment.. In International Society for Music Information Retrieval (ISMIR), pp. 411–416. Cited by: §II, §II, §IV-B, TABLE II.
  • [23] C. Joder, S. Essid, and G. Richard (2010) A comparative study of tonal acoustic features for a symbolic level music-to-score alignment. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 409–412. Cited by: §II.
  • [24] C. Joder, S. Essid, and G. Richard (2011) Optimizing the mapping from a symbolic to an audio representation for music-to-score alignment. In IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 121–124. Cited by: §II.
  • [25] C. Joder, S. Essid, and G. Richard (2013) Learning optimal features for polyphonic audio-to-score alignment. IEEE Transactions on Audio, Speech, and Language Processing 21 (10), pp. 2118–2128. Cited by: §II.
  • [26] Y. LeCun, P. Haffner, L. Bottou, and Y. Bengio (1999) Object recognition with gradient-based learning. In Shape, Contour and Grouping in Computer Vision, pp. 319–345. Cited by: §I.
  • [27] F. Luo and R. Unbehauen (1999) In Applied Neural Networks For Signal Processing, Cited by: §I.
  • [28] P. Manocha, R. Badlani, A. Kumar, A. Shah, B. Elizalde, and B. Raj (2018) Content-based representations of audio using siamese neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3136–3140. Cited by: §II.
  • [29] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto (2015) Librosa: audio and music signal analysis in python. In Proceedings of the 14th Python in Science Conference, pp. 18–25. Cited by: §IV-A.
  • [30] M. Müller, V. Konz, W. Bogler, and V. Arifi-Müller (2011) Saarland music data (smd). In International Society for Music Information Retrieval: late breaking session, Cited by: §IV-A.
  • [31] O. Nieto and J. P. Bello (2014) Music segment similarity using 2d-fourier magnitude coefficients. In ICASSP, pp. 664–668. Cited by: §II.
  • [32] C. Qi and F. Su (2017) Contrastive-center loss for deep neural networks. In 2017 IEEE International Conference on Image Processing (ICIP), pp. 2851–2855. Cited by: §III-A.
  • [33] H. Sakoe and S. Chiba (1978) Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing 26 (1), pp. 43–49. Cited by: §I.
  • [34] C. S. Sapp (2007) Comparative analysis of multiple musical performances.. In International Society for Music Information Retrieval (ISMIR), pp. 497–500. Cited by: §I, §IV-A.
  • [35] E. M. Schmidt, J. J. Scott, and Y. E. Kim (2012) Feature learning in dynamic environments: modeling the acoustic structure of musical emotion.. In International Society for Music Information Retrieval (ISMIR), pp. 325–330. Cited by: §II.
  • [36] J. Thickstun, Z. Harchaoui, and S. Kakade (2016) Learning features of music from scratch. arXiv preprint arXiv:1611.09827. Cited by: §II.
  • [37] C. J. Tralie and B. McFee (2019) Enhanced hierarchical music structure annotations via feature level similarity fusion. In ICASSP, pp. 201–205. Cited by: §II.