Modeling syllables instead of words is a frequent choice for languages of syllabic nature. However, modeling syllables instead of words can also improve tasks related to automatic speech recognition (ASR) such as keyword search or identifying out-of-vocabulary (OOV) words [20, 22]. Agglutinative languages, such as Turkish, Finish or Swahili, which form words by putting long successions of word units together can benefit from good syllable recognition as they regularly build new words from basic morphemes [5, p. 293]. Furthermore, speech therapy often relies on assessing different granularities of speech from phrases, words, syllables down to phones. For example, fluency in speech is often measured in terms of (dis-)fluent syllables or their durations [2, 27, 11, 12].
Traditional word-based ASR systems typically depend on prior knowledge in form of pronunciation dictionaries. While these can grow up to millions of words, the resulting systems still regularly face OOV events. Phoneme recognition could help to avoid those, however error rates are typically around 20-35% and thus not reliable enough . This forms the motivation for syllable recognition. A languages such as German has about 40 phonemes and, due to its feature to form compound words, a near infinite vocabulary. Although it is hard to come up with a definite number, datasets suggest that about 3000 syllables – most of them of rare count – are sufficient to model large vocabularies.111The vocabulary of the later used Verbmobil data consists 2825 distinct syllables. The average length of German words are 1.83 syllables [5, p. 87].
End-to-End models have become increasingly popular in speech recognition over the last couple of years. Their success suggests that modeling prior knowledge of natural language has become insignificant. Especially Baidu’s Deep Speech
implementation has taken great part in establishing end-to-end speech recognition as a viable alternative to the traditional hybrid systems that combine hidden Markov models (HMM) and deep neural networks (DNN)[9, 3]. These systems work under the assumption, that they are able to learn the acoustic model (AM) and the language model (LM) in a combined effort without the need for specialized prior-infused systems for each task. Results on large datasets indicate that no explicit language model is needed and prior knowledge about the language is in fact not important to successfully create ASR systems [6, 3, 21, 10]. Due to the complexity of the models and the vast amount of data necessary, end-to-end systems require substantial resources to train which makes it relatively expensive to adapt these systems to new requirements, such as a different vocabulary or domain. A huge advantage of the traditionally structured systems is that they can in part be trained very quickly and easily adapted to a new context or new requirements. Another drawback of most end-to-end approaches is that they lack accurate time alignment information, which is needed for applications such as keyword search or paralinguistic speech processing.
The contribution of this paper is a detailed comparison of a traditional hybrid HMM/DNN system and end-to-end multi-layer long short-term memory (LSTM) network trained using connectionist temporal classification (CTC) for syllable-based ASR using a medium-sized corpus of spontaneous German speech.
For all experiments conducted for this paper, the Verbmobil (VM) corpus was used which comprises of recordings of the VM1 and VM2 partitions. It contains recordings of spontaneous dialogs of people making appointments and general travel planning over the phone. All systems were trained and evaluated on the published training and evaluation partitions.222Dataset definitions can be obtained from bas.uni-muenchen.de The dataset used consists of 24435 utterances in 962 dialogs spoken by 629 speakers, the provided lexicon contains 9036 German words. The training set amounts to about 46 hours of audio. All audio data was sampled at 16 kHz and is exclusively in German . We chose the Verbmobil data over the larger Voxforge corpus since written text, and thus its read speech, is much more structured and fluent than spontaneous speech – which is the typical data for the applications described above. Since the LSTMs will jointly learn the language model, we preferred to stick to data which is closer related to the applications in question.
The lexicon with hand-labeled syllable boundary information was provided by the Friedrich Alexander University (FAU). It is a result of merging lexica that were created for the research projects EVAR , Verbmobil , SmartKom  and SmartWeb . Guidelines used in its creation closely match the ones described in . The lexicon itself was incomplete since it contained only words from the VM1 partition of the dataset. The missing 2963 entries in the lexicon were generated by using the phonetisaurus grapheme-to-phoneme (g2p) model trained on the FAU syllable lexicon.333G2P tool available online at https://github.com/AdolfVonKleist/Phonetisaurus For about 40 words, the g2p model was unable to produce a proper syllable representation and those were thus manually generated. The syllable transliteration of the training text were obtained by translating the text using the previously generated German syllable lexicon for the VM corpus.
. The end-to-end systems were trained using the multi-purpose machine learning toolkit TensorFlow, using the same Kaldi-generated features. We use two “views” on the data: Six experiments were conducted using all (2825) syllables and six experiments were conducted using a subset of 199 syllables. This was done to reduce the number of targets for the CTC training and to reduce model size of the hybrid system. This split may seem arbitrary at first but was made because 80% of the dataset consisted of the most common 199 syllables. The remainder of the syllables amounted to only 20% of the data. Even though structured models handle many syllables well, it is interesting to study how that number affects their performance and to do a full fair comparision of the two modeling approaches.
The features used for both systems are 40-dimensional Mel Frequency Cepstral Coefficients (MFCC “hi-res”) with cepstral mean and variance normalization (CMVN) applied. Although i-vectors result in slightly better performance, we did not include those since the point of our experiments is to compare traditional hybrid and end-to-end models.
3.1 Hybrid System
The setup of the kaldi speech recognition system is based on the kaldi Wall Street Journal (WSJ) recipe, which is a hybrid HMM/DNN system. For the training of the AM, the “nnet3 chain” implementation was used, which uses a time-delay neural network (TDNN) architecture with maximum mutual information (MMI) sequence-level objective function [16, 18]. The targets for the DNN training are the context-dependent states obtained by forced alignments of an HMM/GMM baseline system.
For the reduced syllable set, the less common syllables were removed from the lexicon and therefore regarded as unknown (unk). Additionally, silence, laughter, noise, vocalized noise and unintelligible sounds were added to the lexicon. The basis for the experiments was a common word-based speech recognition system trained on the VM dataset yielding a word error rate (WER) of 8.7% with a lexicon containing 9036 words using a 4-gram word based LM with a perplexity of 59.53. Based on this system, the lexicon was replaced by the previously generated syllable lexicon, and the language model was replaced by a syllable-based language model.
|Reduced set||Full set|
3.2 End-to-End System
Long short-term memory (LSTM) units are a special kind of unit in recurrent neural networks (RNN) that are able to learn long-term dependencies through a specific gating mechanism. The connectionist temporal classification (CTC) loss function uses a blank symbol and an enumeration of all possible alignments. This makes it slower but independent of per-frame time alignments and allows training deep neural networks for tasks that require long context. For the end-to-end training, the original word-based transcripts were mapped to syllable transliterations.
The basic network type used in this paper were bidirectional LSTMs. This is especially useful in situations when future events can help to disambiguate current events. They were chosen because they produce good results for ASR [7, 8]. The script used for training the models was based on Ford deepDSP’s deepSpeech implementation 444base implementation source code available online under https://github.com/fordDeepDSP/deepSpeech which is based on Deep Speech 2 . All end-to-end models were trained using an ADAM optimizer. Learning rate decay was used to adapt the learning rate as training progressed from 0.001 down to 0.00003. The experiments were performed with three different network topologies. For the first two end-to-end experiments, a network with a single hidden layer and 256 nodes, was used. The following two experiments were conducted with three fully connected hidden layers of 196 nodes each. For the final two experiments, we added regularization in form of dropout at training time (50%). Otherwise, the network still consisted of 3 hidden layers with 196 nodes each. Fig. 1 shows the final topology. The output of the network was taken as is and no post-processing applied. At the time of writing, no lattice generation or LM-based prefix beam search was performed with the output of the bLSTM networks.
4.1 Hybrid System
The baseline models to be compared to the end-to-end models that are assumed to learn AM an LM in one training, 0-gram (no prior) and 1-gram (weak prior) LMs were trained. The average length of German words is about 1.83 syllables [5, p. 87], thus a 4-gram syllable LM represents contextual knowledge about the composition of words plus some context. Perplexities for the language models evaluated on the test set can be found in Tab. 1. The LMs for all experiments always used the complete training set with either a complete or a reduced syllable set.
The results for experiments performed with kaldi can be found on the left part of Tab. 2. As expected, the experiments with the 0-gram language model show very weak performance since they foremost rely on the AM. This confirms the important role of the LM in hybrid systems. The WER555For the remainder of this article, WER is computed w.r.t. the syllable transcription. for models with more syllables is consistently better; likewise, higher order LMs consistently perform better. With a WER of 10% the model with the 4-gram LM and the complete lexicon yields the best result overall. There is a huge improvement in WER from the 0-gram to the 1-gram model. Even a very simple LM with no contextual information leads to this huge improvements. With fewer syllables, the WER still improves but also shows problems of LM that rely on learned sequences and lose some of their power when facing lots of unknown words. The test dataset consists of about 37,000 syllable instances of which about 6,800 instances were regarded as OOV (about 18%). This helps to put results for the reduced syllable experiments into perspective since decoding performance suffers significantly from a high OOV rate.
4.2 End-2-End System
Tab. 2 shows the WER results for the CTC-trained bLSTM experiments using three different network topologies. The results indicate that more layers lead to better performance, with a best WER result of 27.53% using the reduced number of syllables, dropout for regularization and three hidden layers. This confirms the assumption that deeper networks perform better and the network does better with fewer training targets. The difference in performance between reduced and full syllable sets almost vanishes for the networks where no dropout was performed. However, the reduction of targets does not appear to be a deciding factor: Relative improvement is only about 7.5% in WER and does not appear reasonable compared to the amount of information lost. On the other hand, adding dropout and more layers leads to a 24% relative improvement in performance for the experiments with the complete target set.
|System||0-gram||1-gram||4-gram||1 Layer||3 Layer||3 Layer + dropout|
A surprising observation is that the number of training targets (ie. syllables) does not seem to have a strong impact on the WER results as previously assumed for the bLSTM training. With more layers, the difference between decoding results almost vanishes just by adding additional layers. Dropout improves performance up to 18%, showing the importance of regularization and also confirming the assumption that depth is important [8, 3]. The bLSTM results are a bit worse than expected which could be due to the relatively small dataset.
The hybrid ASR system generally performs better with more targets as soon as it has some useful context information through the LM. This shows the huge impact of the LM on the decoding results. As expected, the results for the decoding with more unknown syllables are much worse than with all targets and a 4-gram LM. Also, the flexibility of the hybrid model is to be mentioned. Training times for the bLSTM networks were extensive, especially when compared to the swiftness of LM training and decoding graph generation. The hybrid system can easily be adapted to fit a new lexicon and language model. The results for this medium-sized dataset were significantly better.
Even though syllables and words are units of different size and error rates are not directly comparable without performing post-processing on the syllable recognition results, it is still interesting to see that error rates for syllable recognition are only slightly worse than the results for word recognition on this dataset. For future work, it might be important to evaluate the accuracy of the time alignments which are important to many downstream tasks such as keyword search or paralinguistic analyses.
6 Conclusion and Outlook
The direct training of an end-to-end system for syllable recognition is feasible but not as reliable as the hybrid system using a LM, which may be different with more training data. For this specific task with on a medium-sized corpus, the hybrid approach yields significantly better results. To achieve better performance with the bLSTMs, its output needs to be combined with LM based prefix beam search, or to train the syllable network along with a LM as proposed in .
For future experiments, we plan to investigate syllable-to-word transduction in order to build an end-to-end system that has high word accuracy while requiring little memory and CPU usage.
-  (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Cited by: §3.
-  (2018-08) The speech efficiency score (SES): a time-domain measure of speech fluency. Journal of Fluency Disorders. External Links: Cited by: §1.
-  (2016) Deep speech 2: end-to-end speech recognition in english and mandarin. In International Conference on Machine Learning, pp. 173–182. Cited by: §1, §3.2, §5.
-  (2001) Guidelines for ‘text-to-phone’ (ttp) conversion, i.e., the sampa inventory plus some other conventions for the smartkom lexicon version 1.0. Technical report Friedrich-Alexander-Universtität Erlangen Nürnberg. Cited by: §2.
-  (2004) Die cambridge enzyklopädie der sprache. Zweitausendeins Frankfurt a.M.. External Links: Cited by: §1, §1, §4.1.
-  (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning, pp. 369–376. Cited by: §1, §3.2.
-  (2013) Hybrid speech recognition with deep bidirectional lstm. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on, pp. 273–278. Cited by: §3.2.
-  (2013) Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pp. 6645–6649. Cited by: §3.2, §5.
-  (2014) First-pass large vocabulary continuous speech recognition using bi-directional recurrent dnns. arXiv preprint arXiv:1408.2873. Cited by: §1.
-  (2014) Deep speech: scaling up end-to-end speech recognition. CoRR abs/1412.5567. External Links: Cited by: §1.
-  (2013) Procedures used for assessment of stuttering frequency and stuttering duration. Clinical Linguistics & Phonetics 27 (12), pp. 853–861. Cited by: §1.
-  (2006-09) Guidelines for statistical analysis of percentage of syllables stuttered data. Journal of speech, language, and hearing research : JSLHR 49, pp. 867–78. External Links: Cited by: §1.
-  (2011) Phoneme recognition on the timit database. In Speech Technologies, Cited by: §1.
-  (1985) The speech understanding and dialog system evar. In New Systems and Architectures for Automatic Speech Recognition and Synthesis, R. De Mori and C. Y. Suen (Eds.), Berlin, Heidelberg, pp. 271–302. Cited by: §2.
-  (2015) JHU aspire system: robust LVCSR with tdnns, ivector adaptation and RNN-LMS. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2015, Scottsdale, AZ, USA, December 13-17, 2015, pp. 539–546. Cited by: §3.
-  (2015) A time delay neural network architecture for efficient modeling of long temporal contexts. In Sixteenth Annual Conference of the International Speech Communication Association, Cited by: §3.1.
-  (2011) The kaldi speech recognition toolkit. In In IEEE 2011 workshop, Cited by: §3.
-  (2016) Purely sequence trained neural networks for asr based on lattice free mmi (author’s manuscript). Technical report The Johns Hopkins University Baltimore United States. Cited by: §3.1.
-  (2003) SmartKom: adaptive and flexible multimodal access to multiple applications. In Proceedings of the 5th International Conference on Multimodal Interfaces, ICMI ’03, New York, NY, USA, pp. 101–108. External Links: Cited by: §2.
-  (2013) A study on lvcsr and keyword search for tagalog. In INTERSPEECH 2013, 14th Annual Conference of the International Speech Communication Association (ISCA), Lyon, France, August 2013., Cited by: §1.
-  (2015) Learning acoustic frame labeling for speech recognition with recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pp. 4280–4284. Cited by: §1.
-  (2017-08) Improved subword modeling for wfst-based speech recognition. In Proceedings of Interspeech 2017, Interspeech: Annual Conference of the International Speech Communication Association, pp. 2551–2555 (English). External Links: Cited by: §1.
-  (2017) Cold fusion: training seq2seq models together with language models. arXiv preprint arXiv:1708.06426. Cited by: §6.
-  (2002) SRILM – an extensible language modeling toolkit. In IN PROCEEDINGS OF THE 7TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING (ICSLP 2002, pp. 901–904. Cited by: §3.
SmartWeb: mobile applications of the semantic web.
KI 2004: Advances in Artificial Intelligence, S. Biundo, T. Frühwirth, and G. Palm (Eds.), Berlin, Heidelberg, pp. 50–51. Cited by: §2.
-  (2013-04-17) Verbmobil: foundations of speech-to-speech translation. Springer. Cited by: §2, §2.
-  (1997) Clinical measurement of stuttering behaviors. Contemporary Issues in Communication Science and Disorders 24 (24), pp. 33–44. Cited by: §1.