Conventional ASR systems consist of three independent components: an acoustic model (AM), a pronunciation model (PM) and a language model (LM), all of which are trained independently. CD-states and CD-phonemes are dominant as their modeling units in such systems [1, 2, 3]. However, it recently has been challenged by sequence-to-sequence attention-based models. These models are commonly comprised of an encoder
, which consists of multiple recurrent neural network (RNN) layers that model the acoustics, and adecoder, which consists of one or more RNN layers that predict the output sub-word sequence. An attention layer acts as the interface between the encoder and the decoder: it selects frames in the encoder representation that the decoder should attend to in order to predict the next sub-word unit . In , Tara et al. experimentally verified that the grapheme-based sequence-to-sequence attention-based model can outperform the corresponding phoneme-based model on English ASR tasks. This work is very interesting and amazing since a hand-designed lexicon might be removed from ASR systems. As we known, it is very laborious and time-consuming to generate a pronunciation lexicon. Without a hand-designed lexicon, the design of ASR systems would be simplified greatly. Furthermore, the latest work shows that attention-based encoder-decoder architecture achieves a new state-of-the-art WER on a hour English voice search task using the word piece models (WPM), which are sub-word units ranging from graphemes all the way up to entire words .
Since the outstanding performance of grapheme-based modeling units on English ASR tasks, we conjecture that maybe there is no need for a hand-designed lexicon on Mandarin Chinese ASR tasks as well by sequence-to-sequence attention-based models. In Mandarin Chinese, if a hand-designed lexicon is removed, the modeling units can be words, sub-words and characters. Character-based sequence-to-sequence attention-based models have been investigated on Mandarin Chinese ASR tasks in [7, 8], but the performance comparison with different modeling units are not explored before. Building on our work , which shows that syllable based model with the Transformer can perform better than CI-phoneme based counterpart, we investigate five modeling units on Mandarin Chinese ASR tasks, including CI-phonemes, syllables (pinyins with tones), words, sub-words and characters. The Transformer is chosen to be the basic architecture of sequence-to-sequence attention-based model in this paper [9, 10]. Experiments on HKUST datasets confirm our hypothesis that the lexicon free modeling units, i.e. words, sub-words and characters, can outperform lexicon related modeling units, i.e. CI-phonemes and syllables. Among five modeling units, character based model with the Transformer achieves the best result and establishes a new state-of-the-art CER of on HKUST datasets without a hand-designed lexicon and an extra language model integration, which is a relative reduction in CER compared to the existing best CER of by the joint CTC-attention based encoder-decoder network with a separate RNN-LM integration .
2 Related work
Sequence-to-sequence attention-based models have achieved promising results on English ASR tasks and various modeling units have been studied recently, such as CI-phonemes, CD-phonemes, graphemes and WPM [4, 5, 6, 12]. In , Tara et al. first explored sequence-to-sequence attention-based model trained with phonemes for ASR tasks and compared the modeling units of graphemes and phonemes. They experimentally verified that the grapheme-based sequence-to-sequence attention-based model can outperform the corresponding phoneme-based model on English ASR tasks. Furthermore, the modeling units of WPM have been explored in , which are sub-word units ranging from graphemes all the way up to entire words. It achieved a new state-of-the-art WER on a hour English voice search task.
Although sequence-to-sequence attention-based models perform very well on English ASR tasks, related works are quite few on Mandarin Chinese ASR tasks. Chan et al. first proposed Character-Pinyin sequence-to-sequence attention-based model on Mandarin Chinese ASR tasks. The Pinyin information was used during training for improving the performance of the character model. Instead of using joint Character-Pinyin model, 
directly used Chinese characters as network output by mapping the one-hot character representation to an embedding vector via a neural network layer. What’s more, compared the modeling units of characters and syllables by sequence-to-sequence attention-based models.
Besides the modeling unit of character, the modeling units of words and sub-words are investigated on Mandarin Chinese ASR tasks in this paper. Sub-word units encoded by byte pair encoding (BPE) have been explored on neural machine translation (NMT) tasks to address out-of-vocabulary (OOV) problem on open-vocabulary translation, which iteratively replace the most frequent pair of characters with a single, unused symbol. We extend it to Mandarin Chinese ASR tasks. BPE is capable of encoding an open vocabulary with a compact symbol vocabulary of variable-length sub-word units, which requires no shortlist.
3 System overview
3.1 ASR Transformer model architecture
The Transformer model architecture is the same as sequence-to-sequence attention-based models except relying entirely on self-attention and position-wise, fully connected layers for both the encoder and decoder . The encoder maps an input sequence of symbol representations x = to a sequence of continuous representations z = . Given z, the decoder then generates an output sequence y = of symbols one element at a time.
The ASR Transformer architecture used in this work is the same as our work  which is shown in Figure 1. It stacks multi-head attention (MHA)  and position-wise, fully connected layers for both the encode and decoder. The encoder is composed of a stack of
identical layers. Each layer has two sub-layers. The first is a MHA, and the second is a position-wise fully connected feed-forward network. Residual connections are employed around each of the two sub-layers, followed by a layer normalization. The decoder is similar to the encoder except inserting a third sub-layer to perform a MHA over the output of the encoder stack. To prevent leftward information flow and preserve the auto-regressive property in the decoder, the self-attention sub-layers in the decoder mask out all values corresponding to illegal connections. In addition, positional encodings are added to the input at the bottoms of these encoder and decoder stacks, which inject some information about the relative or absolute position of the tokens in the sequence.
3.2 Modeling units
Five modeling units are compared on Mandarin Chinese ASR tasks, including CI-phonemes, syllables, words, sub-words and characters. Table 1 summarizes the different number of output units investigated by this paper. We show an example of various modeling units in Table 2.
3.2.1 CI-phoneme and syllable units
CI-phoneme and syllable units are compared in our work , which CI-phonemes without silence (phonemes with tones) are employed in the CI-phoneme based experiments and syllables (pinyins with tones) in the syllable based experiments. Extra tokens
(i.e. an unknown token (
/)) are appended to the outputs, making the total number of outputs
respectively in the CI-phoneme based model and syllable based model. Standard tied-state cross-word triphone GMM-HMMs are first trained with maximum likelihood estimation to generate CI-phoneme alignments on training set. Then syllable alignments are generated through these CI-phoneme alignments according to the lexicon, which can handle multiple pronunciations of the same word in Mandarin Chinese.
The outputs are CI-phoneme sequences or syllable sequences during decoding stage. In order to convert CI-phoneme sequences or syllable sequences into word sequences, a greedy cascading decoder with the Transformer  is proposed. First, the best CI-phoneme or syllable sequence is calculated by the ASR Transformer from observation with a beam size . And then, the best word sequence is chosen by the NMT Transformer from the best CI-phoneme or syllable sequence with a beam size . Through cascading these two Transformer models, we assume that can be approximated.
Here the beam size and are employed in this work.
3.2.2 Sub-word units
, which iteratively merges the most frequent pair of characters or character sequences with a single, unused symbol. Firstly, the symbol vocabulary with the character vocabulary is initialized, and each word is represented as a sequence of characters plus a special end-of-word symbol ‘@@’, which allows to restore the original tokenization. Then, all symbol pairs are counted iteratively and each occurrence of the most frequent pair (‘A’, ‘B’) are replaced with a new symbol ‘AB’. Each merge operation produces a new symbol which represents a character n-gram. Frequent character n-grams (or whole words) are eventually merged into a single symbol. Then the final symbol vocabulary size is equal to the size of the initial vocabulary, plus thenumber of merge operations
, which is the hyperparameter of this algorithm.
BPE is capable of encoding an open vocabulary with a compact symbol vocabulary of variable-length sub-word units, which requires no shortlist. After encoded by BPE, the sub-word units are ranging from characters all the way up to entire words. Thus there are no OOV words with BPE and high frequent sub-words can be preserved.
In our experiments, we choose the number of merge operations , which generates the number of sub-words units from the training transcripts. After appended with extra tokens, the total number of outputs is .
3.2.3 Word and character units
For word units, we collect all words from the training transcripts. Appended with extra tokens, the total number of outputs is .
For character units, all Mandarin Chinese characters together with English words in training transcripts are collected, which are appended with extra tokens to generate the total number of outputs 222we manually delete two tokens and , which are not Mandarin Chinese characters..
The HKUST corpus (LDC2005S15, LDC2005T32), a corpus of Mandarin Chinese conversational telephone speech, is collected and transcribed by Hong Kong University of Science and Technology (HKUST) 
, which contains 150-hour speech, and 873 calls in the training set and 24 calls in the test set. All experiments are conducted using 80-dimensional log-Mel filterbank features, computed with a 25ms window and shifted every 10ms. The features are normalized via mean subtraction and variance normalization on the speaker basis. Similar to[17, 18], at the current frame , these features are stacked with 3 frames to the left and downsampled to a 30ms frame rate. As in , we generate more training data by linearly scaling the audio lengths by factors of and (speed perturb.), which can improve the performance in our experiments.
We perform our experiments on the base model and big model (i.e. D512-H8 and D1024-H16 respectively) of the Transformer from . The basic architecture of these two models is the same but different parameters setting. Table 3 lists the experimental parameters between these two models. The Adam algorithm 
with gradient clipping and warmup is used for optimization. During training, label smoothing of valueis employed . After trained, the last 20 checkpoints are averaged to make the performance more stable .
In the CI-phoneme and syllable based model, we cascade an ASR Transformer and a NMT Transformer to generate word sequences from observation . However, we do not employ a NMT Transformer anymore in the word, sub-word and character based model, since the beam search results from the ASR Transformer are already the Chinese character level. The total parameters of different modeling units list in Table 4.
According to the description from Section 3.2, we can see that the modeling units of words, sub-words and characters are lexicon free, which do not need a hand-designed lexicon. On the contrary, the modeling units of CI-phonemes and syllables need a hand-designed lexicon.
Our results are summarized in Table 5. It is clear to see that the lexicon free modeling units, i.e. words, sub-words and characters, can outperform corresponding lexicon related modeling units, i.e. CI-phonemes and syllables on HKUST datasets. It confirms our hypothesis that we can remove the need for a hand-designed lexicon on Mandarin Chinese ASR tasks by sequence-to-sequence attention-based models. What’s more, we note here that the sub-word based model performs better than the word based counterpart. It represents that the modeling unit of sub-words is superior to that of words, since sub-word units encoded by BPE have fewer number of outputs and without OOV problems. However, the sub-word based model performs worse than the character based model. The possible reason is that the modeling unit of sub-words is bigger than that of characters which is difficult to train. We will conduct our experiments on larger datasets and compare the performance between the modeling units of sub-words and characters in future work. Finally, among five modeling units, character based model with the Transformer achieves the best result. It demonstrates that the modeling unit of character is suitable for Mandarin Chinese ASR tasks by sequence-to-sequence attention-based models, which can simplify the design of ASR systems greatly.
4.4 Comparison with previous works
In Table 6, we compare our experimental results to other model architectures from the literature on HKUST datasets. First, we can find that our best results of different modeling units are comparable or superior to the best result by the deep multidimensional residual learning with 9 LSTM layers , which is a hybrid LSTM-HMM system with the modeling unit of CD-states. We can observe that the best CER of character based model with the Transformer on HKUST datasets achieves a relative reduction compared to the best CER of by the deep multidimensional residual learning with 9 LSTM layers. It shows the superiority of the sequence-to-sequence attention-based model compared to the hybrid LSTM-HMM system.
Moreover, we can note that our best results with the modeling units of words, sub-words and characters are superior to the existing best CER of by the joint CTC-attention based encoder-decoder network with a separate RNN-LM integration , which is the state-of-the-art on HKUST datasets to the best of our knowledge. Character based model with the Transformer establishes a new state-of-the-art CER of on HKUST datasets without a hand-designed lexicon and an extra language model integration, which is a relative reduction in CER compared to the CER of of the joint CTC-attention based encoder-decoder network when no external language model is used, and a relative reduction in CER compared to the existing best CER of by the joint CTC-attention based encoder-decoder network with separate RNN-LM .
In this paper we compared five modeling units on Mandarin Chinese ASR tasks by sequence-to-sequence attention-based model with the Transformer, including CI-phonemes, syllables, words, sub-words and characters. We experimentally verified that the lexicon free modeling units, i.e. words, sub-words and characters, can outperform lexicon related modeling units, i.e. CI-phonemes and syllables on HKUST datasets. It represents that maybe we can remove the need for a hand-designed lexicon on Mandarin Chinese ASR tasks by sequence-to-sequence attention-based models. Among five modeling units, character based model achieves the best result and establishes a new state-of-the-art CER of on HKUST datasets without a hand-designed lexicon and an extra language model integration, which corresponds to a relative improvement over the existing best CER of by the joint CTC-attention based encoder-decoder network. Moreover, we find that sub-word based model with the Transformer, encoded by BPE, achieves a promising result, although it is slightly worse than character based counterpart.
The authors would like to thank Chunqi Wang and Feng Wang for insightful discussions.
-  G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” IEEE Transactions on audio, speech, and language processing, vol. 20, no. 1, pp. 30–42, 2012.
H. Sak, A. Senior, and F. Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” inFifteenth annual conference of the international speech communication association, 2014.
-  A. Senior, H. Sak, and I. Shafran, “Context dependent phone models for lstm rnn acoustic modelling,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 4585–4589.
-  R. Prabhavalkar, T. N. Sainath, B. Li, K. Rao, and N. Jaitly, “An analysis of attention in sequence-to-sequence models, ,” in Proc. of Interspeech, 2017.
-  T. N. Sainath, R. Prabhavalkar, S. Kumar, S. Lee, A. Kannan, D. Rybach, V. Schogol, P. Nguyen, B. Li, Y. Wu et al., “No need for a lexicon? evaluating the value of the pronunciation lexica in end-to-end models,” arXiv preprint arXiv:1712.01864, 2017.
-  C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, K. Gonina et al., “State-of-the-art speech recognition with sequence-to-sequence models,” arXiv preprint arXiv:1712.01769, 2017.
-  W. Chan and I. Lane, “On online attention-based speech recognition and joint mandarin character-pinyin training.” in INTERSPEECH, 2016, pp. 3404–3408.
-  C. Shan, J. Zhang, Y. Wang, and L. Xie, “Attention-based end-to-end speech recognition on voice search.”
-  S. Zhou, L. Dong, S. Xu, and B. Xu, “Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese,” ArXiv e-prints, Apr. 2018.
-  B. X. Linhao Dong, Shuang Xu, “Speech-transformer: A no-recurrence sequence-to-sequence model for speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2018 IEEE International Conference on. IEEE, 2018, pp. 5884–5888.
-  T. Hori, S. Watanabe, Y. Zhang, and W. Chan, “Advances in joint ctc-attention based end-to-end speech recognition with a deep cnn encoder and rnn-lm,” arXiv preprint arXiv:1706.02737, 2017.
-  R. Prabhavalkar, K. Rao, T. N. Sainath, B. Li, L. Johnson, and N. Jaitly, “A comparison of sequence-to-sequence models for speech recognition,” in Proc. Interspeech, 2017, pp. 939–943.
-  W. Zou, D. Jiang, S. Zhao, and X. Li, “A comparable study of modeling units for end-to-end mandarin speech recognition,” arXiv preprint arXiv:1805.03832, 2018.
-  R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” arXiv preprint arXiv:1508.07909, 2015.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 6000–6010.
-  Y. Liu, P. Fung, Y. Yang, C. Cieri, S. Huang, and D. Graff, “Hkust/mts: A very large scale mandarin telephone speech corpus,” in Chinese Spoken Language Processing. Springer, 2006, pp. 724–735.
-  H. Sak, A. Senior, K. Rao, and F. Beaufays, “Fast and accurate recurrent neural network acoustic models for speech recognition,” arXiv preprint arXiv:1507.06947, 2015.
-  A. Kannan, Y. Wu, P. Nguyen, T. N. Sainath, Z. Chen, and R. Prabhavalkar, “An analysis of incorporating an external language model into a sequence-to-sequence model,” arXiv preprint arXiv:1712.01996, 2017.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818–2826.
-  Y. Zhao, S. Xu, and B. Xu, “Multidimensional residual learning based on recurrent neural networks for acoustic modeling,” Interspeech 2016, pp. 3419–3423, 2016.