Log In Sign Up

A Hybrid Approach to Audio-to-Score Alignment

by   Ruchit Agrawal, et al.

Audio-to-score alignment aims at generating an accurate mapping between a performance audio and the score of a given piece. Standard alignment methods are based on Dynamic Time Warping (DTW) and employ handcrafted features. We explore the usage of neural networks as a preprocessing step for DTW-based automatic alignment methods. Experiments on music data from different acoustic conditions demonstrate that this method generates robust alignments whilst being adaptable at the same time.


page 1

page 2

page 3

page 4


Learning Frame Similarity using Siamese networks for Audio-to-Score Alignment

Audio-to-score alignment aims at generating an accurate mapping between ...

Structure-Aware Audio-to-Score Alignment using Progressively Dilated Convolutional Neural Networks

The identification of structural differences between a music performance...

Audio-to-Score Alignment using Transposition-invariant Features

Audio-to-score alignment is an important pre-processing step for in-dept...

Rethinking Evaluation Methodology for Audio-to-Score Alignment

This paper offers a precise, formal definition of an audio-to-score alig...

Exact, Parallelizable Dynamic Time Warping Alignment with Linear Memory

Audio alignment is a fundamental preprocessing step in many MIR pipeline...

AlignNet: A Unifying Approach to Audio-Visual Alignment

We present AlignNet, a model that synchronizes videos with reference aud...

Audio-to-score alignment of piano music using RNN-based automatic music transcription

We propose a framework for audio-to-score alignment on piano performance...

1 Introduction and Motivation

Audio-to-score alignment is the task of finding the optimal mapping between a performance and the score for a given piece of music. Dynamic Time Warping (Sakoe and Chiba, 1978) has been the de facto standard for this task, typically incorporating handcrafted features (Dixon, 2005; Arzt, 2016). Recent advances in Music Information Retrieval have demonstrated the efficacy of Deep Neural Networks (DNNs) to a variety of tasks like music generation (Eck and Schmidhuber, 2002), audio classification (Lee et al., 2009), onset detection (Marolt et al., 2002), music transcription (Marolt, 2001; Hawthorne et al., 2017) as well as music alignment (Dorfer et al., 2018a). The primary advantage of DNNs is that they can learn directly from data in an end-to-end manner, thereby eschewing the need for complex feature engineering. However, DNNs struggle with modelling long-term dependencies (Bengio et al., 1994) in temporal sequences. End-to-end alignment is a challenging task since it incorporates dealing with multiple inputs of different modalities, in addition to handling of very long term dependencies. This paper is an endeavor towards employing neural networks for music alignment. We present a hybrid approach to audio-to-score alignment, which consists of a neural network based preprocessing step as a precursor to Dynamic Time Warping. This approach involves computing a frame similarity matrix which is then passed on to a DTW algorithm that computes the optimal warping path through this matrix. The advantage of our method is that the preprocessing step is trainable, thereby making our method adaptable to a particular acoustic setting, unlike traditional DTW-based methods which employ handcrafted features.

2 Related Work

Early works on feature learning for MIR tasks employ algorithms like Hidden Markov Models

(Joder et al., 2013)

or deep belief networks

(Schmidt et al., 2012). Recently, a number of works have explored feature learning for MIR using deep neural networks (Sigtia and Dixon, 2014; Oramas et al., 2017; Thickstun et al., 2016; Lattner et al., 2018; Arzt and Lattner, 2018; Korzeniowski and Widmer, 2016). Work specifically on learning features for audio-to-score alignment has mainly focused on an evaluation of current feature representations (Joder et al., 2010), learning of the mapping for several common audio representations based on a best-fit criterion (Joder et al., 2011) and learning transposition-invariant features (Arzt and Lattner, 2018) for alignment. (Hamel et al., 2013)

propose transfer learning for MIR tasks by learning learn a shared latent representation across related tasks of classification and similarity detection. Weaknesses in standard approaches to choosing similarity thresholds has been explored in

(Kinnaird, 2017). (İzmirli and Dannenberg, 2010) propose the idea of learning features for aligning two sequences of music, as opposed to employing a standard chroma-based feature representation. (Nieto and Bello, 2014) present a novel algorithm to capture music segment similarity using two-dimensional Fourier-magnitude coefficients. (Korzeniowski and Widmer, 2016) explore frame-level audio feature learning for chord recognition using artificial neural networks.

3 Experiments and Results

The standard feature representation choice for music alignment is a time-chroma representation generated from the log-frequency spectrogram. Since this representation only relies on pitch class information, it ignores variations in timbre and instrumentation, and is not adaptable to different acoustic settings. Using neural networks helps us to override the manual feature engineering whilst providing the capability to adapt to different settings. Rather than extracting a feature representation from the inputs, we focus on the task of constructing a frame-similarity matrix. This matrix is then passed on to a DTW-based algorithm to generate the alignments. We employ a “Siamese” Convolutional Neural Network

(Bromley et al., 1994)

for this task. This framework has shown promising results in the field of computer vision for computing image similarity

(Zagoruyko and Komodakis, 2015)

, as well as in the field of natural language processing, for learning sentence similarity

(Mueller and Thyagarajan, 2016) and speaker identification (Lukic et al., 2016) amongst others.
We train our Siamese CNN model to determine if two patches, one from the audio and one from the synthesized MIDI “match” or not. A similar approach has been used by (İzmirli and Dannenberg, 2010)

, however they use a Multi-Layer Perceptron (MLP) framework to compute if two frames are the same or not. In addition to using an enhanced framework which is optimal for this task, our work differs from them in that we also compute similarity labels (non-binary) and use this distance matrix further for alignment. We explain the preprocessing steps below:

In order to keep the modality constant, we first convert the MIDI files to audio using FluidSynth (Henningsson and Team, 2011). We then transform the frame-level audio patches to image spectrograms using librosa (McFee et al., 2015)

, a Python library for audio and music analysis. We conduct experiments using both the Short-Time Fourier transform (STFT) as well as the Constant-Q transform (CQT) transformations of the raw audios. We briefly explain our choice of loss function below:

The objective of our Siamese CNN model is not to classify the inputs, but to differentiate between them. Hence, a contrastive loss function is much better suited to this task than a standard classification loss function like cross entropy. The contrastive loss function is computed as follows:

where is the Euclidean Distance between the outputs of the two Siamese twin networks. More formally, can be expressed as follows:

where is the output of each of the twin networks and and are the two inputs.

We train the model on the MAPS database (Emiya et al., 2010), where we have MIDI-aligned audio for a range of acoustic settings. We only select the subset containing the recordings played using a real piano, and discard the ones which are software-synthesized. We compute the similarity matrix using two mechanisms:

  • Using binary labels: For this we employ the output labels of our Siamese CNN. 0’s imply similar pairs, 1’s imply dissimilar pairs.

  • Using distances: For this we calculate . The distance directly corresponds to the dissimilarity between the two inputs. Higher the value of , higher the dissimilarity.

We then generate an alignment path through this matrix using fast-DTW (Salvador and Chan, 2007), through a readily available DTW implementation in Python 111 We test the performance of our model on a subset of the Mazurka dataset (Sapp, 2007) which contains recordings from various acoustic settings. The results obtained using both methods are given in Table 1.

Type of matrix STFT CQT
Binary 76.3 78.6
Distance 78.1 81.4
Table 1: Alignment accuracy (in %)

Our results suggest that this method is a promising approach to alignment, especially in non-standard acoustic conditions; since the pre-processing Siamese Network is trainable on such data, unlike the manually handcrafted features used by standard DTW-based algorithms.

4 Conclusion and Future Work

We demonstrated that our hybrid approach to audio-to-score alignment is capable of generating robust alignments across various acoustic conditions. The advantage of our method is that it is adaptable to a particular acoustic setting without requiring a large amount of labeled training data. In the future, we would like to conduct an exhaustive evaluation of this approach on musically relevant parameters and analyze its limitations. We would also like to work on learning the features as well as the alignments in a completely end-to-end manner.


  • A. Arzt and S. Lattner (2018) Audio-to-score alignment using transposition-invariant features. arXiv preprint arXiv:1807.07278. Cited by: §2.
  • A. Arzt, G. Widmer, and S. Dixon (2012) Adaptive distance normalization for real-time music tracking. In 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO), pp. 2689–2693.
  • A. Arzt (2016) Flexible and robust music tracking. Ph.D. Thesis, Ph. D. thesis, Universität Linz, Linz. Cited by: §1.
  • S. Bell and K. Bala (2015) Learning visual similarity for product design with convolutional neural networks. ACM Transactions on Graphics (TOG) 34 (4), pp. 98.
  • Y. Bengio, P. Simard, P. Frasconi, et al. (1994) Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks 5 (2), pp. 157–166. Cited by: §1.
  • J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah (1994) Signature verification using a” siamese” time delay neural network. In Advances in neural information processing systems, pp. 737–744. Cited by: §3.
  • J. J. Carabias-Orti, F. J. Rodríguez-Serrano, P. Vera-Candeas, N. Ruiz-Reyes, and F. J. Cañadas-Quesada (2015) An audio to score alignment framework using spectral factorization and dynamic time warping.. In International Society for Music Information Retrieval, pp. 742–748.
  • S. Dixon (2005) An on-line time warping algorithm for tracking musical performances.. In IJCAI, pp. 1727–1728. Cited by: §1.
  • M. Dorfer, A. Arzt, and G. Widmer (2016) Towards score following in sheet music images. arXiv preprint arXiv:1612.05050.
  • M. Dorfer, A. Arzt, and G. Widmer (2017) Learning audio-sheet music correspondences for score identification and offline alignment. arXiv preprint arXiv:1707.09887.
  • M. Dorfer, J. Hajič Jr, A. Arzt, H. Frostel, and G. Widmer (2018a) Learning audio–sheet music correspondences for cross-modal retrieval and piece identification. Transactions of the International Society for Music Information Retrieval 1 (1). Cited by: §1.
  • M. Dorfer, F. Henkel, and G. Widmer (2018b) Learning to listen, read, and follow: score following as a reinforcement learning game. arXiv preprint arXiv:1807.06391.
  • D. Eck and J. Schmidhuber (2002)

    A first look at music composition using lstm recurrent neural networks

    Istituto Dalle Molle Di Studi Sull Intelligenza Artificiale 103. Cited by: §1.
  • V. Emiya, N. Bertin, B. David, and R. Badeau (2010)

    MAPS-a piano database for multipitch estimation and automatic transcription of music

    Cited by: §3.
  • P. Hamel, M. Davies, K. Yoshii, and M. Goto (2013) Transfer learning in mir: sharing learned latent representations for music audio classification and similarity. Cited by: §2.
  • C. Hawthorne, E. Elsen, J. Song, A. Roberts, I. Simon, C. Raffel, J. Engel, S. Oore, and D. Eck (2017) Onsets and frames: dual-objective piano transcription. arXiv preprint arXiv:1710.11153. Cited by: §1.
  • D. Henningsson and F. D. Team (2011) FluidSynth real-time and thread safety challenges. In Proceedings of the 9th International Linux Audio Conference, Maynooth University, Ireland, pp. 123–128. Cited by: §3.
  • N. Hu, R. B. Dannenberg, and G. Tzanetakis (2003) Polyphonic audio matching and alignment for music retrieval. In Applications of Signal Processing to Audio and Acoustics, 2003 IEEE Workshop on., pp. 185–188.
  • Ö. İzmirli and R. B. Dannenberg (2010) Understanding features and distance functions for music sequence alignment.. In International Society for Music Information Retrieval, pp. 411–416. Cited by: §2, §3.
  • C. Joder, S. Essid, and G. Richard (2010) A comparative study of tonal acoustic features for a symbolic level music-to-score alignment. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 409–412. Cited by: §2.
  • C. Joder, S. Essid, and G. Richard (2011) Optimizing the mapping from a symbolic to an audio representation for music-to-score alignment. In 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 121–124. Cited by: §2.
  • C. Joder, S. Essid, and G. Richard (2013) Learning optimal features for polyphonic audio-to-score alignment. IEEE Transactions on Audio, Speech, and Language Processing 21 (10), pp. 2118–2128. Cited by: §2.
  • K. M. Kinnaird (2017) Examining musical meaning in similarity thresholds.. In International Society for Music Information Retrieval, pp. 635–641. Cited by: §2.
  • F. Korzeniowski and G. Widmer (2016) Feature learning for chord recognition: the deep chroma extractor. arXiv preprint arXiv:1612.05065. Cited by: §2.
  • S. Lattner, M. Grachten, and G. Widmer (2018) Learning transposition-invariant interval features from symbolic music and audio. arXiv preprint arXiv:1806.08236. Cited by: §2.
  • H. Lee, P. Pham, Y. Largman, and A. Y. Ng (2009) Unsupervised feature learning for audio classification using convolutional deep belief networks. In Advances in neural information processing systems, pp. 1096–1104. Cited by: §1.
  • Y. Lukic, C. Vogt, O. Dürr, and T. Stadelmann (2016) Speaker identification and clustering using convolutional neural networks. In 2016 IEEE 26th international workshop on machine learning for signal processing (MLSP), pp. 1–6. Cited by: §3.
  • M. I. Mandel, G. E. Poliner, and D. P. Ellis (2006) Support vector machine active learning for music retrieval. Multimedia systems 12 (1), pp. 3–13.
  • M. Marolt, A. Kavcic, and M. Privosnik (2002) Neural networks for note onset detection in piano music. In Proceedings of the 2002 International Computer Music Conference, Cited by: §1.
  • M. Marolt (2001) SONIC: transcription of polyphonic piano music with neural networks. In Workshop on Current Research Directions in Computer Music, pp. 217–224. Cited by: §1.
  • B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto (2015) Librosa: audio and music signal analysis in python. In Proceedings of the 14th python in science conference, pp. 18–25. Cited by: §3.
  • J. Mueller and A. Thyagarajan (2016) Siamese recurrent architectures for learning sentence similarity. In

    Thirtieth AAAI Conference on Artificial Intelligence

    Cited by: §3.
  • M. Muller, S. Ewert, and S. Kreuzer (2009) Making chroma features more robust to timbre changes. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1877–1880.
  • O. Nieto and J. P. Bello (2014) Music segment similarity using 2d-fourier magnitude coefficients. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 664–668. Cited by: §2.
  • S. Oramas, O. Nieto, F. Barbieri, and X. Serra (2017)

    Multi-label music genre classification from audio, text, and images using deep features

    arXiv preprint arXiv:1707.04916. Cited by: §2.
  • E. Pampalk, S. Dixon, and G. Widmer (2003) On the evaluation of perceptual similarity measures for music. In of: Proceedings of the sixth international conference on digital audio effects (DAFx-03), pp. 7–12.
  • J. Pons, O. Nieto, M. Prockup, E. Schmidt, A. Ehmann, and X. Serra (2017) End-to-end learning for music audio tagging at scale. arXiv preprint arXiv:1711.02520.
  • H. Sakoe and S. Chiba (1978) Dynamic programming algorithm optimization for spoken word recognition. IEEE transactions on acoustics, speech, and signal processing 26 (1), pp. 43–49. Cited by: §1.
  • S. Salvador and P. Chan (2007) Toward accurate dynamic time warping in linear time and space. Intelligent Data Analysis 11 (5), pp. 561–580. Cited by: §3.
  • C. S. Sapp (2007) Comparative analysis of multiple musical performances.. In International Society for Music Information Retrieval, pp. 497–500. Cited by: §3.
  • E. M. Schmidt, J. J. Scott, and Y. E. Kim (2012) Feature learning in dynamic environments: modeling the acoustic structure of musical emotion.. In International Society for Music Information Retrieval, pp. 325–330. Cited by: §2.
  • S. Sigtia and S. Dixon (2014) Improved music feature learning with deep neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pp. 6959–6963. Cited by: §2.
  • J. Thickstun, Z. Harchaoui, and S. Kakade (2016) Learning features of music from scratch. arXiv preprint arXiv:1611.09827. Cited by: §2.
  • O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638.
  • S. Zagoruyko and N. Komodakis (2015) Learning to compare image patches via convolutional neural networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 4353–4361. Cited by: §3.