Automatic Lyrics Alignment and Transcription in Polyphonic Music: Does Background Music Help?

09/23/2019 ∙ by Chitralekha Gupta, et al. ∙ 0

Background music affects lyrics intelligibility of singing vocals in a music piece. Automatic lyrics alignment and transcription in polyphonic music are challenging tasks because the singing vocals are corrupted by the background music. In this work, we propose to learn music genre-specific characteristics to train polyphonic acoustic models. We first compare several automatic speech recognition pipelines for the application of lyrics transcription. We then present the lyrics alignment and transcription performance of music-informed acoustic models for the best-performing pipeline, and systematically study the impact of music genre and language model on the performance. With such genre-based approach, we explicitly model the music without removing it during acoustic modeling. The proposed approach outperforms all competing systems in the lyrics alignment and transcription tasks on several well-known polyphonic test datasets.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Lyrics is an important component of music, and people often recognize a song by its lyrics. Lyrics contribute to the mood of the song [1], affect the opinion of a listener about the song [2], and even help in foreign language learning [3]. Automatic lyrics alignment is the task of finding word boundaries of the given lyrics with the polyphonic audio, while transcription is the task of recognizing the sung lyrics from audio. These are useful for various music information retrieval applications such as generating karaoke scrolling lyrics, music video subtitling, query-by-singing [4], keyword spotting, and automatic indexing of music according to transcribed keywords [5].

Automatic lyrics transcription of singing vocals in the presence of background music remains an unsolved problem. One of the earliest studies [6] conducted frame-wise phoneme classification in polyphonic music where it was attempted to recognize three broad-classes of phonemes in 37 popular songs using acoustic features such as MFCCs, and PLP. Mesaros et al. [7] adopted an automatic speech recognition (ASR) based approach for phoneme and word recognition of singing vocals in monophonic and polyphonic music.

Singing vocals are often highly correlated with the corresponding background music, resulting in overlapping frequency components [8]. To suppress the background accompaniment, many approaches have incorporated singing voice separation techniques as a pre-processing step [7, 9, 10]. However, this step makes the system dependent on the performance of the singing voice separation algorithm, as the separation artifacts may make the words unrecognizable. Moreover, this requires a separate training setup for the singing voice separation system. In our latest work [11], we trained acoustic models on a large amount of solo singing vocals and adapted them towards polyphonic music using a small amount of in-domain data – extracted singing vocals, and polyphonic audio. We found that domain adaptation with polyphonic data outperforms that with extracted singing vocals. This suggests that acoustic model adapted with polyphonic data captures the spectro-temporal variations of vocals+background music better than that adapted with extracted singing vocals which have distortions and artifacts.

Recently, Stoller et al. [12] presented a data intensive end-to-end approach to lyrics transcription and alignment from raw polyphonic audio. However, end-to-end systems require a large amount of annotated training polyphonic music data to perform well, as seen in [12] that uses more than 44,000 songs with line-level lyrics annotations from Spotify’s proprietary music library, while publicly available resources for polyphonic music are limited.

Instead of treating the music as background noise, we hypothesize that acoustic models induced with music knowledge will help in lyrics alignment and transcription in polyphonic music. Schultz and Huron [13] found that genre-specific musical attributes such as instrumental accompaniment, singing vocal loudness, syllable rate, reverberation, and singing style influence human intelligibility of lyrics. In this study, we train genre-informed acoustic models for automatic lyrics transcription and alignment using an openly available polyphonic audio resource. We discuss several variations for the ASR components, such as, the acoustic model, and the language model (LM), and systematically study their impact on lyrics recognition and alignment accuracy on well-known polyphonic test datasets.

2 Lyrics alignment and transcription framework

Our goal is to build a designated ASR framework for automatic lyrics alignment and transcription. We explore and compare various approaches to understand the impact of background music, and the genre of the music on acoustic modeling. We detail the design procedure followed to gauge the impact of different factors in the following subsections.

2.1 Singing vocal extraction vs. polyphonic audio

Earlier approaches to lyrics transcription have used acoustic models that were trained on solo-singing audio. Singing vocal extraction was then applied on the test data [7, 9, 14]. Such acoustic models can be adapted to a small set of extracted vocals to reduce the mismatch of acoustic models between training and testing [11]. Now that we have available a relatively large polyphonic lyrics annotated dataset (DALI)[15], we explore two approaches for acoustic modeling for the task of lyrics transcription and alignment: (1) to apply singing vocal extraction from the polyphonic audio as a pre-processing step, and train acoustic models with the extracted singing vocals, and (2) to train acoustic models using the lyrics annotated polyphonic dataset directly. Approach (1) treats the background music as the background noise and suppresses it. On the other hand, approach (2) observes the combined effect of vocals and music on acoustic modeling. With these two approaches, we would like to answer the question whether background music helps in acoustic modeling for lyrics transcription and alignment.

2.2 Standard ASR vs. end-to-end ASR

Given the state-of-the-art lyrics alignment and transcription system is an end-to-end ASR trained on a large polyphonic audio dataset [12]

, we compare the performance of a standard ASR pipeline, comprising of a separate acoustic model, language model, and pronunciation lexicon, with an end-to-end ASR on the lyrics transcription task using the limited publicly available resources.

Hosoya et al. [4] described lyrics recognition grammar using a finite state automaton (FSA) built from the lyrics in the queried database, so as to exploit the linguistic constraints in lyrics such as rhyming patterns, connecting words, and grammar [16]

. However, these methods have been tested only on small solo-singing datasets, and their scalability to a larger vocabulary recognition of polyphonic songs needs to be tested. In this work, we investigate the performance of standard N-gram techniques, also used for large vocabulary ASR, for lyrics transcription in polyphonic songs. We train two N-gram models: (1) an in-domain LM (henceforth referred to as the lyrics LM) trained only on the lyrics from the training music data and (2) a general LM trained on a large publicly available text corpus extracted from different resources.

The lyrics transcription quality of the standard ASR architecture is compared with an end-to-end system, both trained on the same polyphonic training data. The end-to-end ASR approach learns how to map spectral audio features to characters without explicitly using a language model and pronunciation lexicon [17, 18, 19, 12]. The end-to-end ASR system is trained using a multiobjective learning framework with a connectionist temporal classification (CTC) objective function and an attention decoder appended to a shared encoder [19]. A joint decoding scheme has been used to combine the information provided by the hybrid model consisting of CTC and attention decoder components and hypothesize the most likely recognition output.

2.3 Genre-informed acoustic modeling

Genre of a music piece is characterized by background instrumentation, rhythmic structure, and harmonic content of the music [20]. Factors such as instrumental accompaniment, vocal harmonization, and reverberation are expected to interfere with lyric intelligibility, while predictable rhyme schemes and semantic context might improve intelligibility [13]. They found that across 12 different genres, the overall lyrics intelligibility for humans is 71.7% (i.e. the percentage of correctly identified words from a total of 25,408 words), where “Death Metal” excerpts received intelligibility scores of zero, while “Pop” excerpts achieved scores close to 100%.

2.3.1 Genre-informed phone models

One main difference between genres that affects lyric intelligibility is the relative volume of the singing vocals compared to the background accompaniment. For example, as observed in [13], in metal songs, the accompaniment is loud and interferes with the vocals, while is relatively softer in jazz, country, and pop songs. Figure 1(a) is the spectrogram of a pop song excerpt showing loud singing vocals with visible singing voice harmonics. On the other hand, Figure 1(b) shows the dense spectrogram of a metal song that has amplified distortion on electric guitar, and loud beats, with relatively soft singing vocals. Another difference between genres is the syllable rate. In [13], it was observed that rap songs, that have a higher syllable rate, show lower lyric intelligibility than other genres. The hip hop song in Figure 1(c) has clear and rapid vocalization corresponding to a rhythmic speech in presence of beats. We believe that genre-specific acoustic modelling of phones would capture the combined effect of background music and singing vocals, depending on the genre, and help in automatic lyrics transcription and alignment.

Figure 1: Spectrogram of 5 seconds audio clip of vocals with background music (sampling frequency:16kHz, spectrogram window size:64ms) for (a) Genre: Pop; Song: Like the Sun, by Explosive Ear Candy (timestamps: 00:16-00:21) (b) Genre: Metal; Song: Voices, by The Rinn (timestamps: 01:07-01:12), and (c) Genre: Hip Hop; Song: The Statement, by Wordsmith (timestamps: 00:16-00:21).

2.3.2 Genre-informed “silence” models

In speech, there are long-duration non-vocal segments that include silence, background noise, and breathing. In an ASR system, a silence acoustic model is separately modeled for better alignment and recognition. Non-vocal segments or musical interludes are also frequently occurring in songs, especially between verses. However, in polyphonic songs, these non-vocal segments consist of different kinds of musical accompaniments that differ across genres. For example, a metal song typically consists of a mix of highly amplified distortion guitar, and emphatic percussive instruments, a typical jazz song consists of saxophone and piano, and a pop song consists of guitar and drums. The spectro-temporal characteristics of the combination of instruments vary across genres, but are somewhat similar within a genre. Thus, we propose to train genre-specific non-vocal or “silence” models to characterize this variability of instrumentation across genres.

2.3.3 Genre broadclasses

Characteristics Genres
hiphop rap, electronic music Rap, Hip Hop, R&B
loud and many background
accompaniments, a mix of percussive
instruments, amplified distortion, vocals
not very loud, rock, psychedelic
Metal, Hard Rock,
Electro, Alternative,
Dance, Disco,
Rock, Indie
vocals louder than the background
accompaniments, guitar, piano,
saxophone, percussive instruments
Country, Pop, Jazz,
Soul, Reggae, Blues,
Table 1: Genre broadclasses grouping

Music has been divided into different genres in many different and overlapping ways, based on a shared set of characteristics [20].

To build genre-informed acoustic models, we consider the shared characteristics between genres that affect lyrics intelligibility, such as type of background accompaniments, and loudness of vocals, and group all genres to three broad genre classes: pop, hiphop, and metal. Table 1 summarizes our genre broadclasses. We categorize songs containing some rap along with electronic music under hiphop broadclass, which includes genres such as Rap, Hip Hop, and Rhythms & Blues. Songs with loud and dense background music are categorized as metal, that includes genres such as Metal and Hard Rock. Songs with clear and louder vocals under genres Pop, Country, Jazz, Reggae etc. are categorized as pop broadclass.

3 Experimental Setup

We conduct three sets of experiments to demonstrate and compare different strategies for lyrics alignment and transcription: (1) train acoustic models using (a) extracted vocal and (b) polyphonic audio and compare their ASR performance, (2) compare a standard ASR system to an end-to-end ASR both trained on polyphonic music audio, and (3) compare the performance of genre-informed acoustic models to the genre-agnostic models, and also explore the impact of lyrics LM and general LM.

3.1 Datasets

All datasets used in the experiments are summarized in Table 2. The training data for acoustic modeling contains 3,913 audio tracks.111Out of a total of 5,358 audio tracks in DALI, only 3,913 were English language and audio links were accessible from Singapore. English polyphonic songs from the DALI dataset [15], consisting of 180,033 lyrics-transcribed lines with a total duration of 134.5 hours.

We evaluated the performance of lyrics alignment and transcription on three test datasets - Hansen’s polyphonic songs dataset (9 songs) [21]222The manual word boundaries of 2 songs in this dataset - clocks and i kissed a girl were not accurate, thus excluded them from the alignment study, Mauch’s dataset (20 songs) [22], and Jamendo dataset (20 songs) [12]. Hansen’s and Mauch’s datasets were used in the MIREX lyrics alignment challenges of 2017 and 2018. These datasets consist mainly of Western pop songs with manually annotated word-level transcription and boundaries. The Jamendo dataset consists of English songs from diverse genres, along with their lyrics transcription and manual word boundary annotations.

Name Content Lyrics Ground-Truth Genre distribution
Training data
DALI [15]
line-level boundaries,
180,033 lines
metal:1,576, pop:2,218
Test data
Hansen [21] 9 songs word-level boundaries, 2,212 words hiphop:1, metal:3, pop:5
Mauch [22] 20 songs word-level boundaries, 5,052 words hiphop:0, metal:8, pop:12
Jamendo [12] 20 songs word-level boundaries, 5,677 words hiphop:4, metal:7, pop:9
Table 2: Dataset description

The genre tags for most of the songs in the training dataset (DALI) is provided in their metadata, except for 840 songs. For these songs, we applied an automatic genre recognition implementation [23] which has 80% classification accuracy, to get their genre tags. We applied the genre groupings from Table 1 to assign a genre broadclass to every song. For the songs in the test datasets, we scanned the web to find their genre tags and categorized them into the three genre broadclasses. The distribution of the number of songs across the three genre broadclasses for all the datasets is shown in Table 2

. This distribution in the training data is skewed towards

pop, while hiphop is the most under-represented. However, we are limited by the amount of data available for training, with DALI being the only resource. Therefore, we assume this to be the naturally occurring distribution of songs across genres.

3.2 Vocal separated data vs. polyphonic data

As discussed in Section 2.1, we compare the strategies of vocal extracted data vs. polyphonic data to train the acoustic models, as a way to find out if the presence of background music helps. We use the reported best performing models M4, from the state-of-the-art Wave-U-Net based audio source separation algorithm [24, 25] for separating vocals from the polyphonic audio.

3.3 ASR framework: standard ASR vs. end-to-end ASR

The ASR system used in these experiments is trained using the Kaldi ASR toolkit [26]

. A factorized time-delay neural network (TDNN-F) model 

[27] with additional convolutional layers (2 convolutional, 10 time-delay layers followed by a rank reduction layer) was trained according to the standard Kaldi recipe (version 5.4) using 40-dimensional MFCCs as acoustic features of an augmented version of the polyphonic training data (Section 3.1[28]

. The default hyperparameters provided in the standard recipe were used and no hyperparameter tuning was performed during the acoustic model training. A duration-based modified pronunciation lexicon is employed which is detailed in 


. Two language models are trained using the transcriptions of the in-domain song-lyrics of DALI dataset (Lyrics LM) and the open source text corpus

333 released as a part of the Librispeech corpus [30] (general LM).

The end-to-end system is trained using the ESPnet toolkit [31]. The shared encoder is a combination of two VGG [32] layers followed by a BLSTM with subsampling [18] with 5 layers and 1024 units. The attention-based decoder is a 2-layer decoder with 1024 units with coverage attention [33]. The batchsize is set to 20 to avoid GPU memory overflow. The rest of the hyperparameters are consistent with the standard Librispeech recipe available in the toolkit (version 0.3.1). In pilot experiments, using a language model during the decoding with the default language model weight provided worse results than decoding without a language model. Therefore, no LM is used during the decoding step to avoid parameter tuning on the test data.

3.4 Genre-informed acoustic modeling

We train 3 different types of acoustic models corresponding to the three genre broadclasses, for (a) genre-informed “silence” or non-vocal models and (b) genre-informed phone models. We extract the non-vocal segments at the start and the end of each line in the training data for the training of “silence” model. For the genre-informed phone modeling, we label the phone units in the phonetic lexicon with genre labels. For the alignment task, we use the same genre-informed phone models that are mapped to the words without genre tags, i.e. the alignment system chooses the best fitting phone models among all genres during the forced alignment, to prevent the additional requirement of genre information for songs in the test sets.

4 Results and Discussion 444Demo:

4.1 Singing vocal extraction vs. polyphonic audio

We compare the performance of a standard ASR trained on extracted singing vocals and polyphonic audio for the tasks of lyrics alignment (Table 3) and transcription (Table 4). The alignment performance is measured as the mean absolute word boundary error (AE) for each song, averaged over all songs of a dataset, in seconds [12, 14], and lyrics transcription performance is measured as the word error rate (WER) which is a standard performance measure for ASR systems. We see an improvement in both alignment and transcription performance with ASR trained on polyphonic data than vocal extracted data, on all the test datasets. This indicates that there is value in modeling the combination of vocals and music, instead of considering the background music as noise and suppressing it. Although we have used the state-of-the-art vocal extraction algorithm, these techniques are still not perfect, and introduce artifacts and distortions in the extracted vocals, which is the reason for poor performance of the models trained with extracted vocals. AE has reduced to less than 350 ms in all the test datasets using polyphonic models given in the third column of Table 3. We observe a large improvement in the alignment accuracy on the Mauch’s dataset. It consists of many songs with long musical interludes, where the extracted vocals models fail to align the lyrics around the long non-vocal sections because of erratic music suppression. Polyphonic models, on the other hand, are able to capture the transitions from music to singing vocals. In the following experiments, we use polyphonic audio to train the acoustic models.

No Genre Info
Genre Silence
Mauch 3.62 0.25 0.28 0.21
Hansen 0.67 0.16 0.25 0.18
Jamendo 0.39 0.34 0.42 0.22
Table 3: Mean absolute word alignment error (AE)(seconds)
Test datasets Vocal extracted Polyphonic
Mauch 76.31 54.08
Hansen 78.85 60.77
Jamendo 71.83 66.58
Table 4: Lyrics transcription WER (%) comparison of vocal extracted vs. polyphonic data trained acoustic models

4.2 Standard ASR vs. end-to-end ASR

The end-to-end ASR’s lyrics transcription performance reported in Table 5 is comparable to the Stoller’s end-to-end system [12], which was however trained on a much larger dataset. The standard ASR performs considerably better than the end-to-end ASR, as can be seen in the second column of Table 5. This implies that characterizing different components of polyphonic music with the standard ASR components using acoustic model, pronunciation model, and language model are valuable for the task of lyrics transcription. The following experiments use the standard ASR framework for exploring genre-informed acoustic modeling.

Test Datasets Standard End-to-end
Mauch 54.08 73.2
Hansen 60.77 80.1
Jamendo 66.58 87.9
Table 5: Comparison of lyrics transcription WER (%) of Standard ASR vs. End-to-end ASR

4.3 Genre-informed acoustic modeling

Lyrics alignment shows an improvement in performance with genre-informed silence+phone models compared to genre-agnostic (or no genre info) and genre-informed silence models, as seen in Table 3. AE is less than 220 ms across all test datasets. This indicates that for the task of lyrics alignment where the target lyrics are known, the genre-informed phone models trained on limited data are able to capture the transition between phones well. Figure 2(a) shows that the alignment error is maximum in metal songs, which is intuitive due to the loud noisy background music.

The lyrics transcription performance for genre-informed silence and silence+phone models using two kinds of LM are presented in Table 6. The genre-informed silence models show 2-4% absolute improvement in the word error rate (WER) over the genre-agnostic models in all the test datasets. This indicates that creating genre-specific models for the non-vocal segments is a good strategy to capture the variability of music across genres. However, genre-informed phone models do not show any improvement in WER. This could be due to the insufficient amount of data to train accurate phone models for three genre types. Hiphop class has the least amount of data, while pop has the most. Figure 2(a) indicates that the performance degradation is more in hiphop songs, than in pop songs. The performance on the metal songs improves with genre-informed silence models, however the WER is high despite the class having data comparable to that for pop songs. This suggests that the loud and dense background music in metal genre hinders the process of learning the singing vocal characteristics for accurate lyrics transcription.

Additionally, we observe an improvement in the lyrics transcription performance with the lyrics LM over general LM. This shows that the linguistic constraints due to the rhyming structure of the lyrics, are better captured by in-domain (song-lyrics) text, rather than by a general text collected from various textual resources.

Test Datasets
No Genre Info
Genre Silence
Genre Silence+Phone
General LM
Mauch 54.08 52.45 53.74
Hansen 60.77 59.10 62.71
Jamendo 66.58 64.42 67.91
Lyrics LM
Mauch 45.78 44.02 45.70
Hansen 50.35 47.01 51.32
Jamendo 62.64 59.57 61.90
Table 6: Comparison of lyrics transcription WER (%)
Figure 2: Comparison of (a) lyrics alignment AE (seconds), and (b) lyrics transcription WER (%) across all the test datasets.

4.4 Comparison with existing literature

In Table 7, we compare our best results with the most recent prior work. Our strategy provides the best results for both lyrics alignment and transcription tasks on several datasets. The proposed strategies show a way to induce music knowledge in ASR to address the problem of lyrics alignment and transcription in polyphonic audio.

MIREX 2017 MIREX 2018 ICASSP 2019 Interspeech2019
AK[34] GD[35, 14] CW [36] DS[12] CG[9] CG[11] Ours
Lyrics Alignment
Mauch 9.03 11.64 4.13 0.35 6.34 1.93 0.21
Hansen 7.34 10.57 2.07 - 1.39 0.93 0.18
Jamendo - - - 0.82 - - 0.22
Lyrics Transcription
Mauch - - - 70.9 - - 44.0
Hansen - - - - - - 47.0
Jamendo - - - 77.8 - - 59.6
Table 7: Comparison of lyrics alignment (AE (seconds)) and transcription (WER%) performance with existing literature.

5 Conclusions

In this work, we introduce a music-informed strategy to train polyphonic acoustic models for the tasks of lyrics alignment and transcription in polyphonic music. We model the genre-specific characteristics of music and vocals, and study their performance with different ASR frameworks, and language models. We find that this music-informed strategy learns the background music characteristics that affect lyrics intelligibility, and shows improvement in lyrics alignment and transcription performance over others with music suppression. We also show that with limited available data, our strategy of genre-informed acoustic modeling as well as lyrics constrained language modeling in a standard ASR pipeline is able to outperform all existing systems for both lyrics alignment and transcription tasks.


  • [1] S. O. Ali and Z. F. Peynircioğlu, “Songs and emotions: are lyrics and melodies equal partners?,” Psychology of Music, vol. 34, no. 4, pp. 511–534, 2006.
  • [2] B. Anderson, D. Berger, R. Denisoff, K. Etzkorn, and P. Hesbacher, “Love negative lyrics: Some shifts in stature and alterations in song,” Communications, vol. 7, no. 1, pp. 3–20, 1981.
  • [3] A. J. Good, F. A. Russo, and J. Sullivan, “The efficacy of singing in foreign-language learning,” Psychology of Music, vol. 43, no. 5, pp. 627–640, 2015.
  • [4] T. Hosoya, M. Suzuki, A. Ito, S. Makino, L. A. Smith, D. Bainbridge, and I. H. Witten, “Lyrics recognition from a singing voice based on finite state automaton for music information retrieval.,” in ISMIR, 2005, pp. 532–535.
  • [5] H. Fujihara, M. Goto, and J. Ogata, “Hyperlinking lyrics: A method for creating hyperlinks between phrases in song lyrics.,” in ISMIR, 2008, pp. 281–286.
  • [6] M. Gruhne, C. Dittmar, and K. Schmidt, “Phoneme recognition in popular music.,” in ISMIR, 2007, pp. 369–370.
  • [7] A. Mesaros and T. Virtanen, “Automatic recognition of lyrics in singing,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2010, no. 1, pp. 546047, 2010.
  • [8] M. Ramona, G. Richard, and B. David,

    “Vocal detection in music with support vector machines,”

    in 2008 Proc. ICASSP. IEEE, 2008, pp. 1885–1888.
  • [9] C.⁢ Gupta, B.⁢ Sharma, H. Li, and Y. Wang, “Automatic lyrics-to-audio alignment on polyphonic music using singing-adapted acoustic models,” in Proc. ICASSP. IEEE, 2019, pp. 396–400.
  • [10] M. Fujihara, H.and Goto, J. Ogata, and H. G. Okuno, “Lyricsynchronizer: Automatic synchronization system between musical audio signals and lyrics,” IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 6, pp. 1252–1261, 2011.
  • [11] C. Gupta, E. Yılmaz, and H. Li, “Acoustic modeling for automatic lyrics-to-audio alignment,” in Proc. INTERSPEECH, Sept. 2019, pp. 2040–2044.
  • [12] D. Stoller, S. Durand, and S. Ewert, “End-to-end lyrics alignment for polyphonic music using an audio-to-character recognition model,” in Proc. ICASSP. IEEE, 2019, pp. 181–185.
  • [13] N. Condit-Schultz and D. Huron, “Catching the lyrics: intelligibility in twelve song genres,” Music Perception: An Interdisciplinary Journal, vol. 32, no. 5, pp. 470–483, 2015.
  • [14] G. Dzhambazov, Knowledge-based Probabilistic Modeling for Tracking Lyrics in Music Audio Signals, Ph.D. thesis, Universitat Pompeu Fabra, 2017.
  • [15] G. Meseguer-Brocal, A. Cohen-Hadria, and G. Peeters,

    “Dali: A large dataset of synchronized audio, lyrics and notes, automatically created using teacher-student machine learning paradigm,”

    in Proc. ISMIR, 2018.
  • [16] J. Fang, D. Grunberg, D. T. Litman, and Y. Wang, “Discourse analysis of lyric and lyric-based classification of music.,” in ISMIR, 2017, pp. 464–471.
  • [17] A. Graves and N. Jaitly,

    “Towards end-to-end speech recognition with recurrent neural networks,”

    in Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32. 2014, ICML’14, pp. II–1764–II–1772,
  • [18] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Proc. ICASSP), March 2016, pp. 4960–4964.
  • [19] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hybrid CTC/Attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, Dec 2017.
  • [20] G. Tzanetakis and P. Cook, “Musical genre classification of audio signals,” IEEE Transactions on Speech and Audio Processing, vol. 10, no. 5, pp. 293–302, 2002.
  • [21] J. K. Hansen, “Recognition of phonemes in a-cappella recordings using temporal patterns and mel frequency cepstral coefficients,” in 9th Sound and Music Computing Conference (SMC), 2012, pp. 494–499.
  • [22] M. Mauch, H. Fujihara, and M. Goto, “Integrating additional chord information into hmm-based lyrics-to-audio alignment,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 1, pp. 200–210, 2012.
  • [23] “Musical genre recognition using a cnn,”, [Online; accessed 5-July-2019].
  • [24] D. Stoller, S. Ewert, and S. Dixon, “Wave-u-net: A multi-scale neural network for end-to-end audio source separation,” in Proc. ISMIR, 2018.
  • [25] “Implementation of the wave-u-net for audio source separation,”, [Online; accessed 5-July-2019].
  • [26] A. Povey, D.and Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi speech recognition toolkit,” in in Proc. ASRU, 2011.
  • [27] D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu, M. Yarmohammadi, and S. Khudanpur, “Semi-orthogonal low-rank matrix factorization for deep neural networks,” in Proc. INTERSPEECH, 2018, pp. 3743–3747.
  • [28] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation for speech recognition,” in Proc. INTERSPEECH, 2015, pp. 3586–3589.
  • [29] C. Gupta, H. Li, and Y. Wang, “Automatic pronunciation evaluation of singing,” Proc. INTERSPEECH, pp. 1507–1511, 2018.
  • [30] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in Proc. ICASSP, April 2015, pp. 5206–5210.
  • [31] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “Espnet: End-to-end speech processing toolkit,” in Proc. INTERSPEECH, 2018, pp. 2207–2211.
  • [32] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2014.
  • [33] A. See, P. J. Liu, and C. D. Manning, “Get to the point: Summarization with pointer-generator networks,” Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017.
  • [34] A. M. Kruspe, “Bootstrapping a system for phoneme recognition and keyword spotting in unaccompanied singing.,” in ISMIR, 2016, pp. 358–364.
  • [35] G. B. Dzhambazov and X. Serra, “Modeling of phoneme durations for alignment between polyphonic audio and lyrics,” in 12th Sound and Music Computing Conference, 2015, pp. 281–286.
  • [36] Chung-Che Wang, “Mirex2018: Lyrics-to-audio alignment for instrument accompanied singings,” in MIREX 2018, 2018.