DeepAI
Log In Sign Up

From Senones to Chenones: Tied Context-Dependent Graphemes for Hybrid Speech Recognition

10/02/2019
by   Duc Le, et al.
Facebook
0

There is an implicit assumption that traditional hybrid approaches for automatic speech recognition (ASR) cannot directly model graphemes and need to rely on phonetic lexicons to get competitive performance, especially on English which has poor grapheme-phoneme correspondence. In this work, we show for the first time that, on English, hybrid ASR systems can in fact model graphemes effectively by leveraging tied context-dependent graphemes, i.e., chenones. Our chenone-based systems significantly outperform equivalent senone baselines by 4.5 Librispeech are state-of-the-art compared to other hybrid approaches and competitive with previously published end-to-end numbers. Further analysis shows that chenones can better utilize powerful acoustic models and large training data, and require context- and position-dependent modeling to work well. Chenone-based systems also outperform senone baselines on proper noun and rare word recognition, an area where the latter is traditionally thought to have an advantage. Our work provides an alternative for end-to-end ASR and establishes that hybrid systems can be improved by dropping the reliance on phonetic knowledge.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

10/22/2019

G2G: TTS-Driven Pronunciation Learning for Graphemic Hybrid ASR

Grapheme-based acoustic modeling has recently been shown to outperform p...
07/07/2021

Advancing CTC-CRF Based End-to-End Speech Recognition with Wordpieces and Conformers

Automatic speech recognition systems have been largely improved in the p...
04/06/2021

Towards Consistent Hybrid HMM Acoustic Modeling

High-performance hybrid automatic speech recognition (ASR) systems are o...
05/15/2020

Context-Dependent Acoustic Modeling without Explicit Phone Clustering

Phoneme-based acoustic modeling of large vocabulary automatic speech rec...
12/09/2021

Are E2E ASR models ready for an industrial usage?

The Automated Speech Recognition (ASR) community experiences a major tur...
07/09/2021

On lattice-free boosted MMI training of HMM and CTC-based full-context ASR models

Hybrid automatic speech recognition (ASR) models are typically sequentia...
05/19/2020

Fast, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces

In this work, we first show that on the widely used LibriSpeech benchmar...

1 Introduction

In the past decade, neural network acoustic models have become a staple in automatic speech recognition (ASR). This movement began with the hybrid approach where Hidden Markov Model (HMM) models the temporal property of speech and Deep Neural Network (DNN) replaces Gaussian Mixture Model (GMM) to estimate emission probabilities of HMM states

[9, 31, 10, 22]

. DNN is subsequently replaced with variants of Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN)

[15, 41, 42, 38, 34, 35], whose primary advantage comes from their ability to model long temporal context. Output units for hybrid ASR are typically tied context-dependent (CD) states/phones, i.e., senones

, which are automatically clustered using decision trees

[47] and require expertly-produced phonetic lexicons. There have been multiple attempts to model graphemes directly within the hybrid ASR framework, motivated by their simplicity [24, 20, 11, 17, 46]. These efforts have been largely successful on low-resource languages, especially those whose written form encodes rich phonetic information [24, 11]. However, on English where the grapheme-phoneme correspondence is poor and phonetic lexicons are highly optimized, grapheme-based approaches have consistently underperformed compared to phonetic baselines in terms of Word Error Rate (WER) [17, 46].

With the rise of end-to-end techniques starting with Connectionist Temporal Classification (CTC) [12, 13, 14, 40, 1], followed by sequence-to-sequence attention-based models [8, 5, 37, 7, 48], and sequence transducers [16, 37, 3, 21], the use of graphemic units has become more prevalent. These systems are able to directly model graphemes, word pieces, or whole words while achieving state-of-the-art performance on various ASR tasks. The ability to not rely on expert phonetic knowledge significantly reduces the development cost of new ASR systems and is often cited as a major advantage of end-to-end methods compared to traditional hybrid/HMM-based approaches. On the other hand, one disadvantage of end-to-end techniques is that they typically require larger amount of data to achieve good performance compared to hybrid methods. It is therefore appealing to combine the efficiency of hybrid ASR with the simplicity of graphemic modeling.

In this paper, we reassess the assumption that hybrid ASR cannot model graphemes effectively, specifically for English. We show that contrary to popular beliefs, hybrid ASR systems utilizing tied CD graphemes, or chenones for short, significantly outperform equivalent senone baselines by 4.5% to 11.1% relative on three different English datasets, the publicly available Librispeech corpus [32] and two large-scale in-house datasets in the Video and Assistant domains. Our chenone-based system achieves one of the best reported numbers to date on Librispeech, with WER of 3.2% on test-clean and 7.6% on test-other using only the provided 4-gram language model (LM) in decoding. We show that chenones can better exploit the increase in model capacity and training data compared to senones, leading to better recognition accuracy. Chenone-based systems also perform better at proper noun and rare word recognition, an area where senones are traditionally thought to have an advantage due to the use of human-curated lexicons and grapheme-to-phoneme (g2p) models. Finally, our ablative analysis shows that the key to achieving good performance with chenones is context- and position-dependent modeling. Based on these results, we conclude that the ability to model graphemes directly is not unique to end-to-end methods, and that traditional hybrid ASR systems can achieve better results by dropping all reliance on phonetic information.

2 Related Work

Hybrid ASR systems have traditionally been built upon phonetic lexicons which map words to sequences of phones that encode their pronunciations. Phonetic lexicons, such as the CMU dictionary111http://www.speech.cs.cmu.edu/cgi-bin/cmudict, are typically produced by linguists and undergo many careful reviews. We can further train a g2p model on these lexicons to predict pronunciations for previously unseen words [4]. The main disadvantage of phonetic-based approaches is that such lexicons are difficult to create and maintain since they require specialized linguistic knowledge about the target language. The simplicity of graphemic modeling is therefore an appealing alternative.

Previous work has shown that for several languages with a regular grapheme-phoneme relationship or complex segmental writing systems, graphemic modeling can perform on-par with or outperform phoneme-based approaches [24, 11]. In [11], the authors proposed to derive phonetic features from the grapheme representation by extracting Unicode character descriptors; this enabled graphemic lexicons with CD modeling to outperform phonetic lexicons. By contrast, for languages that have a simple writing system with no explicit phonetic descriptor and an irregular grapheme-phoneme relationship, such as English, graphemic units have underperformed within the traditional HMM-based framework [44, 23, 20, 17, 46]. In [17], the authors explored end-to-end lattice-free MMI (LF-MMI) training of acoustic models. They showed that context-independent (CI) graphemes performed worse than CI phones on Wall Street Journal and Switchboard. A similar observation was drawn for both CI and CD modeling when benchmarking graphemes against phonemes on a multi-genre broadcast transcription task [46]. In [44], the authors were able to get almost on-par performance using CD graphemes with letter-specific, coda, and onset modeling; however, this was done in the context of HMM-GMM trained with Maximum Likelihood instead of the HMM-DNN framework.

The recent emergence of end-to-end techniques has enabled ASR systems to model graphemes directly while achieving state-of-the-art results on multiple English datasets [40, 7, 33]. Within this paradigm, the modeling can be done at the grapheme level (e.g., “t h e _ c a t”) [12, 5, 39], word piece level (e.g., “_the _c at”) [48, 7, 21], whole word level (e.g., “the _ cat”) [40, 43, 2], or a mixture of words and graphemes [28, 45]. Specifically for English, it has been shown that CI graphemes performed better than CI phones in sequence-to-sequence attention-based models, though the former under-performed on proper nouns and rare words [39]. Being able to remove the reliance on phonetic lexicons greatly simplifies the process of building new ASR systems and is often cited as a major advantage of end-to-end methods over the traditional hybrid approach.

The contributions of this work are two-fold. Firstly, we present our approach to graphemic hybrid ASR which, for the first time, is able to consistently outperform equivalent phonetic baselines on a variety of English datasets. Our approach is based on well-known techniques in hybrid ASR with several modifications for graphemic units. This approach is an alternative for end-to-end ASR, combining the efficiency of traditional hybrid methods and the simplicity of grapheme-based modeling. Secondly, we provide detailed analysis to better understand the differences between phonetic and graphemic systems, including proper noun and rare word recognition accuracy, performance as a function of acoustic model (AM) capacity and amount of training data, as well as the importance of context and position dependency. Such in-depth studies have not been done on hybrid systems in previous work.

3 Data

3.1 Librispeech

Librispeech is a publicly available dataset consisting of audio book recordings (reading-style speech)[32]. The dataset contains 960 hours of acoustic training data, two development subsets (dev-clean, dev-other), and two test sets (test-clean, test-other). The “other” type data is more acoustically challenging than “clean” type data. All four development/test sets are between 5 to 5.5 hours in total duration. The official LM is a 4-gram model with 200K vocabulary trained on audio books (with much more text data than acoustic training transcripts). The official lexicon is a 200K phonetic lexicon with the same vocabulary as the LM, containing both human-curated pronunciations from the CMU dictionaryfootnotemark: and g2p-generated pronunciations.

3.2 Video

This dataset is sampled from in-house English video data publicly shared by users. The data are completely anonymized; both transcribers and researchers do not have access to any user-identifiable information (UII). The training set consists of 941.6K videos or 13.7K hours in length. We use two test sets, vid-clean with 1.4K videos (20.9 hours) and vid-noisy

with 1.3K more acoustically and phonetically challenging videos (20.1 hours). All hyperparameter tuning is done on a held-out development set with 633 videos, totaling 9.7 hours in length. The average video is 52.6 seconds long with a standard deviation of 45.9 seconds. This dataset contains a diverse array of speakers, content types, and acoustic conditions, and is more challenging than the other two datasets considered in this work.

3.3 Assistant

This in-house anonymized dataset is collected through crowd-sourcing. It consists of utterances recorded via mobile devices where crowd-sourced workers ask a smart assistant to carry out certain actions, such as calling a friend, playing music, or getting weather information. The training set comprises 15.7M utterances (12.5K hours) from 20K speakers. The development set for hyper-parameter tuning consists of 48 speakers disjoint from the training set, totaling 34.4K utterances (32.6 hours). The test set (ast-test) contains 58.8K utterances (50.4 hours) from 300 speakers that are unseen in both training and development. The average utterance length is 2.9 seconds with a standard deviation of 1.3 seconds.

4 Graphemic Lexicon for Hybrid Asr

hello h_WB e l l o_WB
Michael’s M_WB i c h a e l ’ s_WB
Ritz-Carlton R_WB i t z - C a r l t o n_WB
DNN D_WB N N_WB
D.N.N. D_WB N N_WB
naïve n_WB a i v e_WB
Table 1: Example entries in our proposed graphemic lexicon.

The primary challenge of graphemic ASR for English is that the mapping from graphemes to sounds is inherently ambiguous. Therefore, the key is to break down the output space with enough granularity for the AM to generalize effectively. Our proposed method is based on three main hypotheses. Firstly, we hypothesize that context dependency is especially important for graphemes. It is well-known in the ASR literature that senones significantly outperform CI phones due to the former’s ability to account for co-articulation [47, 9]. Given that there is a high degree of ambiguity between graphemes and their acoustic realization, we argue that chenones will outperform CI graphemes by an even larger margin. This hypothesis is supported by previous findings, where CI phones outperform CI graphemes in the hybrid paradigm [17, 46] and the improvement from CI to CD is larger for graphemes compared to phonemes [46]. Secondly, we hypothesize that position dependency is important for graphemes. This means that the same grapheme (e.g., “a”) may sound differently depending on whether it appears at the word boundary (e.g., “amber”, “theta”) or in the middle of a word (e.g., “dart”). This is supported by previous experiments with HMM-GMM [44]; however, it is unclear if the result still holds in a hybrid setup. Thirdly, we hypothesize that casing information is important for graphemes. The convention in written English is to upper case abbreviations (e.g., “DNN”) and capitalize proper nouns (e.g., “Michael”). We argue that when the data follow this convention, it is preferable to preserve the casing rather than lower-casing every letter. This may enable the model to better distinguish abbreviations from their lower-cased forms (e.g., “SAT” vs. “sat”) and learn pronunciation variants of proper nouns. Combining context dependency, position dependency, and casing information will create enough granularity for the AM to handle the phonetic ambiguity of graphemes.

Table 1 gives several examples of what our graphemic lexicon looks like after incorporating these three hypotheses. The grapheme set used in this work is limited to the 26 standard English characters (both lower-cased and upper-cased), plus hyphens, apostrophes, and two special tokens, SIL and GARBAGE, which map to silence and out-of-vocabulary (OOV) words, respectively. Graphemes at word boundary positions are annotated with a special WB tag. The WB and non-WB versions of the same graphemes are technically separate acoustic units; however, they may be merged together during decision tree clustering. Letters that are not in the grapheme set are simply ignored; for example, “DNN” and “D.N.N.” map to the same pronunciation since the grapheme “.” is skipped. We convert words with accent marks in them (e.g., “naïve”) to their closest non-accented form using unidecode222https://pypi.org/project/Unidecode/. Once the graphemic lexicon is prepared, we can apply traditional hybrid ASR techniques as usual, treating graphemes analogously as phonemes.

5 Experimental Setup

5.1 Lexicon Preparation

For graphemic lexicons, we follow the process described in Section 4. The final grapheme set differs for each dataset due to different conventions in the training transcripts. Among the three datasets, Video is the only one where the training text has casing information; we use all lower-cased graphemes for the other two datasets.

For phonetic lexicons, we follow the same approach to annotate phones at word boundaries with the WB tag. The Librispeech phonetic lexicon uses the provided CMU phone sets, preserving all stress markers. These different variants of the same phone may be clustered together during decision tree building. The phonetic lexicons for Video and Assistant use our in-house English phone set based on International Phonetic Alphabet (IPA), with no explicit stress markers.

5.2 GMM Bootstrapping and Decision Tree Building

We train a bootstrap HMM-GMM system until Speaker Adapted Training (SAT), following the standard Kaldi Librispeech recipe333https://github.com/kaldi-asr/kaldi/blob/master/egs/librispeech/s5/ (July 2019). This bootstrapping process uses 1K hours of randomly selected training data; for Librispeech this corresponds to the entire training set. We then generate alignments on the bootstrap data and build tri-phonetic/tri-graphemic decision trees with questions automatically generated from the alignment statistics. Each phoneme/grapheme and its word boundary (WB) variant can optionally share the same starting root node and may be clustered together. The number of tied CD phones (senones) and graphemes (chenones) ranges from 7K to 9K across all systems; this was chosen based on bootstrap WER on the development split. We model each phoneme/grapheme using a simple 1-state HMM topology with fixed self-loop and forward probability (0.5).

5.3 Acoustic Model Training

All experiments in this work employ multi-layer Latency-Controlled Bidirectional Long Short-Term Memory RNNs (LC-BLSTM)

[49] with a softmax output layer trained on 80-dimensional log Mel-filterbank extracted with 25ms FFT windows and 10ms frame shift. The LC-BLSTM operates on 1.28s chunks with 200ms lookahead. The default model architecture has four hidden layers with 600 units for each direction (4x600), totaling approximately 35M free parameters. We also try two larger architectures for Librispeech, 5x800 (approx. 80M params), and 6x1000 (approx. 140M params), in order to better understand the AM performance as a function of network size. The first hidden layer of the LC-BLSTM subsamples the input by a factor of two [34], so that the posterior is emitted at a reduced 20ms frame rate.

We employ speed perturbation [26] and SpecAugment’s LD policy [33] when training on Librispeech. Although these are considered data augmentation techniques, they do not use any additional resources other than the provided Librispeech data. We do not use any data augmentation for the other two datasets as we found no significant benefit from these techniques when the training set is large (more than 10K hours) and the model is small (less than 50M params).

The AM training process happens in two stages. First we train the model with Cross Entropy (CE) loss for 25 epochs (Librispeech) or 20 epochs (Video and Assistant) with a batch size of 128, Adam optimizer

[25], learning rate, 0.5 dropout, and Block-wise Model-Update Filtering (BMUF) with 0.875 block momentum [6]. The learning rate is halved whenever the development frame error rate (FER) does not improve. We use 16 GPUs during CE training for Librispeech and 32 GPUs for Video and Assistant. The best CE model in terms of development WER is used as the initial seed for LF-MMI [36] training in the second stage; we found that bootstrapping from CE gives slightly better results than training LF-MMI from scratch. For LF-MMI we train for 8 epochs with a batch size of 32, Adam optimizer,

learning rate, 0.5 dropout, BMUF with 0.875 block momentum, 0.1 CE interpolation weight, and a similar learning rate schedule. We use 24 GPUs during LF-MMI training for Librispeech and 48 GPUs for Video and Assistant. Unlike the original LF-MMI where training chunks are independent

[36], our training chunks have no overlap (except for the lookahead) and the forward LSTM states are carried over between contiguous chunks. The best LF-MMI model in terms of development WER will be used for final evaluation.

Dataset Model Ph Gr
Librispeech test-clean 4x600 4.0 3.8
test-other 9.3 9.6
test-clean 5x800 3.8 3.4
test-other 9.0 8.4
test-clean 6x1000 3.6 3.2
test-other 8.3 7.6
Video vid-clean 4x600 16.1 15.0
vid-noisy 22.9 21.9
Assistant ast-test 4x600 5.2 4.7
Table 2: Word Error Rate (WER) comparison between phonetic (Ph) and graphemic (Gr) hybrid ASR systems.
System AM LM test-clean test-other
RWTH (hybrid) [30] 180M 4g 4.2 9.3
Kaldi TDNN-F footnotemark: 23M 4g 3.8 8.8
CAPIO (single) [18] N/A 4g 3.6 8.9
Wav2Letter [29] 208M 4g 4.8 14.5
LAS + BPE [48] N/A 4g 4.8 15.3
TDS Conv [19] 37M 4g 4.2 11.9
NVIDIA Jasper [27] 333M 6g 3.3 9.6
LAS + SpecAug [33] 360M - 2.8 6.8
LAS + SpecAug [33] 360M 4g 2.5 5.8
Ours (Chenone) 140M 4g 3.2 7.6
Table 3: Librispeech WER compared to published results, limited to single systems (no ensemble) without neural LM.
Phonetic ASR Output Graphemic ASR Output
G then said sir fernando there is nothing for it… then said sir ferdinando there is nothing for it…
…without disturbing the household of gain will …without disturbing the household of gamewell
…mademoiselle determination on thinks… …mademoiselle de tonnay charente thinks…
…and save us from the august might …and save us from the ogre’s might
…in my morning room a jostling strock …in my morning room at joscelyn’s rock
P …the pre socratic philosophy are included… the priests so critic philosophy are included…
a great saint saint francis xavier a great saint saint francis savior
marmalades and jams differ little from… margaret and james differ little from…
…would then be given up to arsinoe …would then be given up to our sueno
…not my kind of haughtiness papa… …not my kind of fortune as papa…

G: graphemic performs better P: phonetic performs better

Table 4: Example graphemic and phonetic ASR output on Librispeech test sets. Errors are indicated in red.

5.4 Language Model and Decoding

We use our in-house one-pass dynamic decoder with n-gram LM for all evaluations. The decoding parameters are tuned to minimize WER on the development set. For Librispeech, we use the official unpruned 4-gram LM with 200K vocabulary and 144M n-grams. For Video, we train a pruned 5-gram LM with 450K vocabulary and 35M n-grams on the training transcripts. For Assistant, we train a pruned 4-gram LM with 85K vocabulary and 23M n-grams on the training transcripts plus additional text data to increase the coverage. The LMs for Librispeech and Assistant are all lower-cased, whereas the one for Video preserves the original casing information.

6 Results and Discussion

6.1 WER Comparison: Phonemes vs. Graphemes

Table 2 summarizes the WER of our phonetic (senone) and graphemic (chenone) hybrid ASR systems on Librispeech, Video, and Assistant. As can be seen, graphemic systems consistently outperform their phonetic counterparts on all three datasets. The relative WER improvement is larger on cleaner test sets, 8.4%–11.1% on Librispeech’s test-clean and test-other, 7.0% on Video’s vid-clean, and 8.3% on Assistant’s ast-test, compared to 4.5% on Video’s vid-noisy. As shown in Table 3, our WERs on Librispeech are state-of-the-art compared to other hybrid models and competitive with end-to-end approaches. We could possibly get further improvement by incorporating neural LM rescoring [30] and speaker adaptationfootnotemark: .

It is interesting to note that as we increase the AM size from 4x600 (35M params) to 6x1000 (140M params) for Librispeech, the graphemic system improves significantly, reducing WER by 15.8% relative on test-clean and 20.8% on test-other. By contrast, the relative improvement is smaller for the phonetic system, 10.0% on test-clean and 10.8% on test-other. This suggests that graphemic units may provide a more fine-grained output space that more powerful acoustic models are able to exploit.

In order to better understand the differences between the two systems, we analyze Librispeech utterances in both test-clean and test-other where the graphemic system performs better and vice versa. As shown in Table 4, the phonetic system tends to misrecognize proper nouns with relatively poor pronunciations such as “de [D AH] tonnay [T AH N EY] charente [SH AA R EH N T EY]”, whereas the graphemic system typically does better on similar words. On the other hand, the graphemic system tends to fail on words whose grapheme sequences do not correspond well to how they are pronounced, such as “xavier” and “arsinoe.” The graphemic system also frequently makes homophone errors where two words are spelled differently but sound similar, such as “parlor” vs. “parlour” and “murdoch” vs. “murdock.” Developing methods to help graphemic systems rectify such errors will be an interesting future research direction.

6.2 Proper Noun and Rare Word Recognition

Dataset Words Split Ph Gr
Librispeech Proper test-clean 7.4 6.0
test-other 16.5 14.4
Rare test-clean 7.6 7.0
test-other 18.0 16.2
Table 5: Character Error Rate on proper nouns and rare words of phonetic (Ph) and graphemic (Gr) hybrid ASR systems.

The goal of this section is to objectively quantify the recognition accuracy on proper nouns and rare words on Librispeech, as a follow-up to the observation in Table 4. We first use an in-house named entity tagger to extract proper nouns from each test set. The number of tagged entities are as follows: test-clean (2.1K), test-other (2.2K). Here are some examples of extracted proper nouns: Buckingham, Missouri, Saint Paul, John Calhoun, Voltaire, Leavenworth, Jean Valjean. We then align the ASR hypothesis against the reference text, and collect segments that are aligned with the tagged entities. Finally, we compute Character Error Rate (CER) on these hypothesis–reference segment pairs, which represents the system’s error rate on proper nouns. We repeat this procedure to quantify CER on rare long-tail words, defined as words in the bottom 80% in terms of frequency in the training set. This results in 1.3K and 1.4K selected words in test-clean and test-other, respectively, or 2.6% of all words in the test sets.

As shown in Table 5, graphemic systems clearly outperform phonetic baselines on proper noun and rare word CER on both test sets. This is rather surprising given that proper nouns and rare words were shown to be a weakness of end-to-end graphemic LAS models [39]. It could be that chenones, due to context and position dependency, are more conducive to accurate proper noun and rare word recognition than the CI graphemes used in their work. It will be an interesting follow-up study to see if end-to-end models can leverage chenones to improve their results further. It is also particularly interesting to compare chenones and word pieces [48, 7, 21] more closely. Both methods can be considered context-dependent modeling, but chenones leverage acoustic information while word pieces only utilize text data.

6.3 Ablative Analysis

Model Hours Dataset Ph Gr
5x800 50 test-clean 7.4 7.3
test-other 18.0 17.8
200 test-clean 5.0 4.6
test-other 11.9 11.8
480 test-clean 4.3 3.8
test-other 9.9 9.5
960 test-clean 3.8 3.4
test-other 9.0 8.4
Table 6: Librispeech Word Error Rate of phonetic (Ph) and graphemic (Gr) ASR as a function of training data size.
Model CD PD Dataset Ph Gr
5x800 N Y test-clean 3.9 4.1
test-other 9.4 10.7
Y N test-clean 4.0 3.9
test-other 8.9 9.2
Y Y test-clean 3.8 3.4
test-other 9.0 8.4
Table 7: Librispeech Word Error Rate of phonetic (Ph) and graphemic (Gr) ASR under different context dependency (CD) and position dependency (PD) settings.

In this section, we analyze the performance of our ASR systems under different conditions to better understand the impact of various modeling decisions.

We first investigate ASR performance as a function of training hours, where we limit LC-BLSTM training to {50, 200, 480, 960} hours of Librispeech data sampled randomly from the overall training set. Table 6 shows that the difference between phonetic and graphemic systems starts out very small at 50 hours and gradually becomes larger as the data size increases. This suggests that graphemic AMs are better able to generalize given large amount of data, echoing the previous observation regarding network size in Section 6.1 as well as the finding that graphemic system improved faster with more data [44]. One caveat with this analysis is that GMM bootstrapping and decision tree building were still done on the full 960 hours. Since these two steps are arguably more difficult for graphemes than phonemes, we may see the difference become smaller or even reversed if GMM bootstrapping and tree building are also done on the trimmed training sets.

In terms of casing information, as hypothesized in Section 4, we observe slightly better performance on Video (the only dataset with casing information) with cased graphemes compared to lower-cased graphemes: 15.0 vs. 15.4 on vid-clean and 21.9 vs. 22.0 on vid-noisy. Most of the improvement came from correctly recognizing abbreviations due to cased grapheme modeling.

Finally, Table 7 shows that context and position dependency are critical for graphemes, but not for phonemes. CI phones outperform CI graphemes by 4.9% on test-clean and 12.1% on test-other. The trend is reversed when we add context dependency (specifically tri-context); while phonetic results did not improve much, graphemic results improved significantly. This confirms that CD modeling is especially important for graphemes and is inline with findings in previous work [17, 46]. Similarly, the phonetic system achieves similar performance with and without position dependency (using the WB tag), but the non-WB graphemic system degrades by 14.7% on test-clean and 9.5% on test-other. This means that the trend observed in HMM-GMM [44] also holds true for hybrid systems.

7 Conclusion and Future Work

In this paper, we establish that hybrid ASR systems utilizing chenones significantly outperform equivalent senone baselines in both overall WER and proper noun/rare word recognition. Powerful acoustic models, large training data, and context/position dependency are crucial to obtain optimal results with chenones. Based on these results, we argue that English hybrid ASR systems can be improved by removing the reliance on phonetic information, which in turn greatly simplifies the development process of new ASR models. Future work will explore using chenones with end-to-end techniques, improving graphemic results on long-tail words with unconventional spellings, and other graphemic modeling approaches beyond chenones.

References

  • [1] D. Amodei et al. (2016) Deep speech 2: end-to-end speech recognition in english and mandarin. In

    Proceedings of the 33rd International Conference on International Conference on Machine Learning (ICML)

    ,
    pp. 173–182. Cited by: §1.
  • [2] K. Audhkhasi, B. Kingsbury, B. Ramabhadran, G. Saon, and M. Picheny (2018) Building competitive direct acoustics-to-word models for english conversational speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 4759–4763. Cited by: §2.
  • [3] E. Battenberg, J. Chen, R. Child, A. Coates, Y. G. Y. Li, H. Liu, S. Satheesh, A. Sriram, and Z. Zhu (2017) Exploring neural transducers for end-to-end speech recognition. In Automatic Speech Recognition and Understanding (ASRU), Vol. , pp. 206–213. Cited by: §1.
  • [4] M. Bisani and H. Ney (2008) Joint-sequence models for grapheme-to-phoneme conversion. Speech communication 50 (5), pp. 434–451. Cited by: §2.
  • [5] W. Chan, N. Jaitly, Q. Le, and O. Vinyals (2016) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. pp. 4960–4964. Cited by: §1, §2.
  • [6] K. Chen and Q. Huo (2016)

    Scalable training of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering

    .
    In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5880–5884. Cited by: §5.3.
  • [7] C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina, N. Jaitly, B. Li, J. Chorowski, and M. Bacchiani (2018) State-of-the-art speech recognition with sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 4774–4778. Cited by: §1, §2, §6.2.
  • [8] J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio (2015) Attention-based models for speech recognition. In Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS), pp. 577–585. Cited by: §1.
  • [9] G. Dahl, D. Yu, L. Deng, and A. Acero (2011) Large Vocabulary Continuous Speech Recognition With Context-Dependent DBN-HMMS. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic. Cited by: §1, §4.
  • [10] G. Dahl, D. Yu, L. Deng, and A. Acero (2012) Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition. IEEE Transactions on Audio, Speech & Language Processing (TASLP) 20 (1). Cited by: §1.
  • [11] M. Gales, K. Knill, and A. Ragni (2015) Unicode-based graphemic systems for limited resource languages. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5186–5190. Cited by: §1, §2.
  • [12] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber (2006) Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning (ICML), pp. 369–376. Cited by: §1, §2.
  • [13] A. Graves, N. Jaitly, and A. Mohamed (2013) Hybrid speech recognition with deep bidirectional lstm. In Automatic Speech Recognition and Understanding (ASRU), Cited by: §1.
  • [14] A. Graves and N. Jaitly (2014) Towards end-to-end speech recognition with recurrent neural networks. In Proceedings of the 31st International Conference on International Conference on Machine Learning (ICML), pp. II–1764–II–1772. Cited by: §1.
  • [15] A. Graves, A. Mohamed, and G. E. Hinton (2013) Speech recognition with deep recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vancouver, BC, Canada. Cited by: §1.
  • [16] A. Graves (2012) Sequence transduction with recurrent neural networks. In ICML Representation Learning Workshop, Cited by: §1.
  • [17] H. Hadian, H. Sameti, D. Povey, and S. Khudanpur (2018) End-to-end speech recognition using lattice-free mmi. In Proceedings of INTERSPEECH, pp. 12–16. Cited by: §1, §2, §4, §6.3.
  • [18] K. J. Han, A. Chandrashekaran, J. Kim, and I. R. Lane (2018) The CAPIO 2017 conversational speech recognition system. arXiv preprint arXiv:1801.00059. Cited by: Table 3.
  • [19] A. Hannun, A. Lee, Q. Xu, and R. Collobert (2019) Sequence-to-sequence speech recognition with time-depth separable convolutions. In Proceedings of INTERSPEECH, Cited by: Table 3.
  • [20] D. Harwath and J. Glass (2014) Speech recognition without a lexicon-bridging the gap between graphemic and phonetic systems.. In Proceedings of INTERSPEECH, pp. 2655–2659. Cited by: §1, §2.
  • [21] Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, Q. Liang, D. Bhatia, Y. Shangguan, B. Li, G. Pundak, K. C. Sim, T. Bagby, S. Chang, K. Rao, and A. Gruenstein (2019) Streaming end-to-end speech recognition for mobile devices. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 6381–6385. Cited by: §1, §2, §6.2.
  • [22] G. Hinton, L. Deng, D. Yu, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, G. Dahl, and B. Kingsbury (2012) Deep Neural Networks for Acoustic Modeling in Speech Recognition. IEEE Signal Processing Magazine 29 (6), pp. 82–97. Cited by: §1.
  • [23] S. Kanthak and H. Ney (2002) Context-dependent acoustic modeling using graphemes for large vocabulary speech recognition. In 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 1, pp. I–845–I–848. External Links: ISSN 1520-6149 Cited by: §2.
  • [24] M. Killer, S. Stuker, and T. Schultz (2003) Grapheme based speech recognition. In European Conference on Speech Communication and Technology (EUROSPEECH), Cited by: §1, §2.
  • [25] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), Cited by: §5.3.
  • [26] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur (2015) Audio augmentation for speech recognition. In Proceedings of INTERSPEECH, Cited by: §5.3.
  • [27] J. Li, V. Lavrukhin, B. Ginsburg, R. Leary, O. Kuchaiev, J. M. Cohen, H. Nguyen, and R. T. Gadde (2019) Jasper: an end-to-end convolutional neural acoustic model. In Proceedings of INTERSPEECH, Cited by: Table 3.
  • [28] J. Li, G. Ye, A. Das, R. Zhao, and Y. Gong (2018) Advancing acoustic-to-word ctc model. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 5794–5798. Cited by: §2.
  • [29] V. Liptchinsky, G. Synnaeve, and R. Collobert (2017) Letter-based speech recognition with gated convnets. arXiv preprint arXiv:1712.09444. Cited by: Table 3.
  • [30] C. Lüscher, E. Beck, K. Irie, M. Kitza, W. Michel, A. Zeyer, R. Schlüter, and H. Ney (2019) RWTH ASR systems for librispeech: hybrid vs attention - w/o data augmentation. In Proceedings of INTERSPEECH, Cited by: Table 3, §6.1.
  • [31] A. Mohamed, G. E. Dahl, and G. E. Hinton (2012)

    Acoustic modeling using deep belief networks

    .
    IEEE Transactions on Audio, Speech & Language Processing 20 (1), pp. 14–22. Cited by: §1.
  • [32] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015) Librispeech: an asr corpus based on public domain audio books. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pp. 5206–5210. Cited by: §1, §3.1.
  • [33] D. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. Cubuk, and Q. Le (2019) Specaugment: a simple data augmentation method for automatic speech recognition. In Proceedings of INTERSPEECH, Cited by: §2, §5.3, Table 3.
  • [34] V. Peddinti, Y. Wang, D. Povey, and S. Khudanpur (2018) Low latency acoustic modeling using temporal convolution and lstms. IEEE Signal Processing Letters 25 (3), pp. 373–377. Cited by: §1, §5.3.
  • [35] D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu, M. Yarmohamadi, and S. Khudanpur (2018) Semi-orthogonal low-rank matrix factorization for deep neural networks. In Proceedings of INTERSPEECH, Cited by: §1.
  • [36] D. Povey, V. Peddinti, D. Galvez, P. Ghahrmani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur (2016) Purely sequence-trained neural networks for ASR based on lattice-free MMI. Proceedings of INTERSPEECH. Cited by: §5.3.
  • [37] R. Prabhavalkar, K. Rao, T. Sainath, B. Li, L. Johnson, and N. Jaitly (2017) A comparison of sequence-to-sequence models for speech recognition. In Proceedings of INTERSPEECH, pp. 939–943. Cited by: §1.
  • [38] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak (2015) Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia. Cited by: §1.
  • [39] T. Sainath, R. Prabhavalkar, S. Kumar, S. Lee, A. Kannan, D. Rybach, V. Schogol, P. Nguyen, B. Li, Y. Wu, et al. (2018) No need for a lexicon? evaluating the value of the pronunciation lexica in end-to-end models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5859–5863. Cited by: §2, §6.2.
  • [40] H. Sak, A. Senior, K. Rao, and F. Beaufays (2015) Fast and accurate recurrent neural network acoustic models for speech recognition. In Proceedings of INTERSPEECH, pp. 1468–1472. Cited by: §1, §2.
  • [41] H. Sak, A. W. Senior, and F. Beaufays (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Proceedings of INTERSPEECH, Singapore. Cited by: §1.
  • [42] H. Sak, O. Vinyals, G. Heigold, A. W. Senior, E. McDermott, R. Monga, and M. Mao (2014) Sequence Discriminative Distributed Training of Long Short-Term Memory Recurrent Neural Networks. In Proceedings of INTERSPEECH, Singapore. Cited by: §1.
  • [43] H. Soltau, H. Liao, and H. Sak (2017-08) Neural speech recognizer: acoustic-to-word lstm model for large vocabulary speech recognition. In Proceedings of INTERSPEECH, pp. 3707–3711. Cited by: §2.
  • [44] Y. Sung, T. Hughes, F. Beaufays, and B. Strope (2009) Revisiting graphemes with increasing amounts of data. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4449–4452. Cited by: §2, §4, §6.3, §6.3.
  • [45] S. Ueno, H. Inaguma, M. Mimura, and T. Kawahara (2018) Acoustic-to-word attention-based model complemented with character-level ctc-based model. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 5804–5808. Cited by: §2.
  • [46] Y. Wang, X. Chen, M. Gales, A. Ragni, and J. Wong (2018) Phonetic and graphemic systems for multi-genre broadcast transcription. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5899–5903. Cited by: §1, §2, §4, §6.3.
  • [47] S. J. Young, J. J. Odell, and P. C. Woodland (1994) Tree-based state tying for high accuracy acoustic modelling. In Proceedings of the Workshop on Human Language Technology, pp. 307–312. Cited by: §1, §4.
  • [48] A. Zeyer, K. Irie, R. Schlüter, and H. Ney (2018)

    Improved training of end-to-end attention models for speech recognition

    .
    In Proceedings of INTERSPEECH, Cited by: §1, §2, Table 3, §6.2.
  • [49] Y. Zhang, G. Chen, D. Yu, K. Yaco, S. Khudanpur, and J. Glass (2016) Highway long short-term memory rnns for distant speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 5755–5759. Cited by: §5.3.