Expanding the coverage of the world’s languages in Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) systems have been attracting much interest in both academia and industry [1, 2]. Conventional phonetically-based speech processing systems require pronunciation dictionaries that map phonetic units to words. Building such resources require expert knowledge for each language. Even with the costly human effort involved, many languages do not have sufficient linguistic resources available for building such dictionaries. Additionally, the inconsistency in the phonetic systems is also challenging to resolve  when merging different languages.
. For these systems, an orthographic lexicon instead of a pronunciation dictionary is used to provide a vocabulary list. With recent advances in end-to-end (E2E) modeling, graphemes have become a popular choice. For example, built a Connectionist Temporal Classification (CTC) model to directly output graphemes, while [9, 10, 11] used graphemes in sequence-to-sequence (seq2seq) models. Sub-word units were used in seq2seq [12, 13, 14] and RNNT  models, and word units were used by [16, 17]. Similarly, graphemes are also commonly used to build end-to-end TTS systems [18, 19, 20].
The use of graphemes bring model simplicity and enables end-to-end optimization, which has been shown to yield better performance than phoneme-based models . However, unlike phonemes, the size of the grapheme vocabulary varies greatly across languages. For example, many eastern languages, such as Chinese, Japanese and Korean, have tens of thousands of graphemes. With limited amounts of training data, many graphemes may have little or no coverage. The label sparsity issue becomes even more severe for multilingual models, where one needs to pool all the distinct graphemes from all languages together resulting in a very large vocabulary that often has long tail graphemes with very poor coverage.
To address these problems, 
explored the use of features from Unicode character descriptions to construct decision trees for clustering graphemes. However, when the model changes to support a new language, the decision tree needs to be updated. Recently, there has been work on exploring the use of Unicode bytes to represent text. presented an LSTM-based multilingual byte-to-span model. The model consumes the input text byte-by-byte and outputs span annotations. The Unicode bytes are language independent and hence a single model can be used for many languages. The vocabulary size of Unicode bytes is always 256 and it does not increase when pooling more languages together, which is more preferable to graphemes for multilingual applications.
In this work, we investigate the potential of representing text using byte sequences introduced in  for speech processing. For ASR, we adopt the Listen, Attend and Spell (LAS) model to convert input speech into sequences of Unicode bytes which correspond to the UTF-8 encoding of the target texts. This model is referred to as the Audio-to-Byte (A2B) model. For TTS, our model is based on the Tacotron 2 architecture , and generates speech signals from an input byte sequence. This model is referred to as the the Byte-To-Audio (B2A) model. Since both the A2B model and the B2A model operate directly on Unicode bytes, they can handle any number of languages written in Unicode without any modification to the input processing. Due to the small vocabulary size being used, 256 in this case, our models can be very compact and very suitable for on-device applications.
We report recognition results for the A2B model on 4 different languages – English, Japanese, Spanish and Korean. First, for each individual language, we compare our A2B models to Audio-to-Char (A2C) models which emit grapheme outputs. For English and Spanish where the graphemes are single-byte characters, A2B has the exact same performance as A2C as expected. However, for languages that have a large grapheme vocabulary, such as Japanese and Korean, the label sparsity issue hurts the performance of A2C models, whereas the A2B model shares bytes across graphemes and performs better than A2C models. Benefiting from the language independence representation of Unicode bytes, we find it is possible to progressively add support for new languages when building a multilingual A2B model. Specifically, we start with an A2B model trained on English and Japanese and add in a new language after convergence. When adding a new language we usually make sure the new language has the highest mixing ratio but meanwhile keeping small portion for each of the existing languages to avoid forgetting older ones. We experiment adding Spanish and Korean one at a time. In this way, we can reuse the previously built model and expand the language coverage without modifying the model structure. For multilingual ASR, we find that the A2B trained in this way is better than training from scratch. In addition, by adding a 1-hot language vector to the A2B system, which has been shown to boost multi-dialect and multilingual  system performance, we find that the multilingual A2B system outperforms all the language dependent ones.
We evaluate the B2A model on 3 different languages, which include English, Mandarin and Spanish. Again, we compare B2A models with those take graphemes as input. For all three languages, B2A has similar performance on quantitative subjective evaluations as graphemes trained on single languages, this providing a more compact multilingual TTS model.
2 Multilingual Audio-to-Byte (A2B)
2.1 Model Structure
The Audio-to-Byte (A2B) model is based on the Listen, Attend and Spell (LAS) 
model, with the output target changed from graphemes to Unicode bytes. The encoder network consists of 5 unidirectional Long Short-Term Memory (LSTMs) layers, with each layer having hidden units. The decoder network consists of 2 unidirectional LSTM layers with hidden units. Additive content-based attention  with 4 attention heads are used to learn the alignment between the input audio features and the output target units. The output layer is a 256 dimensional softmax, corresponding to the 256 possible byte values.
Our front-end consists of 80-dimensional log-mel features, computed with a 25ms window and shifted every 10ms. Similar to [27, 28], at each current frame, these features are stacked with 3 consecutive frames to the left and then down-sampled to a 30ms frame rate. The amount of training data usually varies across languages. For example, for English we have around 3.5 times the amount of data compared to the other languages. More details about data can be found in Section 4. In this work, we adjust the data sampling ratio of the different languages to help tackle the data imbalance. We choose the sampling ratio based on intuition and empirical observations. Specially, we start with mixing the language equally and increase the ratio for a language where the performance needs more improvement. In addition, a simple 1-hot language ID vector has been found to be effective improving multilingual systems [23, 24]. We also adopt this 1-hot language ID vector as additional input passed into the A2B models, and concatenate it to all the layers including both the encoder and decoder layers.
2.2 Output Unit
End-to-end speech recognition models have typically used characters , sub-words , word-pieces  or words  as the output unit of choice. Word-based units are difficult to scale for languages with large vocabularies, which makes the softmax prohibitively large, especially in multilingual models. One solution is to use data-driven word-piece models. Word-pieces learned from data can be trained to have a fixed vocabulary size. But it requires building a new word-piece model when a new language or new data is added. Additionally, building a multilingual word-piece model is challenging due to the unbalanced grapheme distribution. Grapheme units give the smallest vocabulary size among these units; however, some languages still have very large vocabularies. For example our Japanese vocabulary has over k characters. In this work, we explore decomposing graphemes into a sequence of Unicode bytes.
Our A2B model generates the text sequence one Unicode byte at a time. We represent text as a sequence of variable length UTF-8 bytes. For languages with single-byte characters (e.g., English), the use of byte output is equivalent to the grapheme character output. However, for languages with multi-byte characters, such as Japanese and Korean, the A2B model needs to generate a sequence of correct bytes to emit one grapheme token. This requires the model to learn both the short-term within-grapheme byte dependencies, and the long-term inter-grapheme or even inter-word/phrase dependencies, which would be a harder task than grapheme based system.
The main advantage of byte representation is its language independence. Any script of any language representable by Unicode can be represented by a byte sequence, and there is no need to change the existing model structure. However, for grapheme models, whenever there is a new symbol added, there is a need to change the output softmax layer. This language independence makes it more preferable for modeling multiple languages and also code-switching speech within a single model.
3 Multilingual Byte-to-Audio (B2A)
3.1 Model Structure
The Byte-to-Audio (B2A) model is based on Tacotron 2 model. The input byte sequence embedding is encoded by three convolutional layers, which contain 512 filters with shape , followed by a bidirectional long short-term memory (LSTM) layer of 256 units for each direction. The resulting text encodings are accessed by the decoder through a location sensitive attention mechanism, which takes attention history into account when computing a normalized weight vector for aggregation.
The autoregressive decoder network takes as input the aggregated byte encoding, and conditioned on a fixed speaker embedding for each speaker, which is essentially the language ID since our training data has only one speaker per language. Similar to Tacotron 2, we separately train a WaveRNN  to invert mel spectrograms to a time-domain waveform.
|C1||A2B, Random Init||EN+JA+ES||9.7||13.6||11.1||-|
|C2||A2B, Init From B2||8.6||13.2||11.0||-|
|D1||A2B, Init From C2||EN+JA+ES+KO||8.4||13.4||11.3||26.0|
|B3||A2B, Larger Model||EN+JA||8.8||13.6||-||-|
|B4||A2B, Larger Model, LangVec||7.5||13.3||-||-|
|C3||A2B, Init From B4||EN+JA+ES||7.5||12.9||10.8||-|
|D2||A2B, Larger Model, LangVec||EN+JA+ES+KO||8.6||13.5||11.2||25.4|
|D3||A2B, Init From C3||7.0||12.8||10.8||25.0|
|D4||A2B, Init From D3||6.6||12.6||10.7||24.7|
4.1 Byte for ASR
|utts (M)||time (Kh)||utts (K)||time (h)|
Our speech recognition experiments are conducted on a human transcribed supervised training set consisting speech from 4 different languages, namely English (EN), Japanese (JA), Spanish (ES) and Korean (KO). The total amount of data is around 76,000 hours and the language-specific information can be found in Table 2. These training utterances are anonymized and hand-transcribed, and are representative of Google’s voice search and dictation traffic. These utterances are further artificially corrupted using a room simulator , adding varying degrees of noise and reverberation such that the overall SNR is between 0dB and 30dB, with an average SNR of 12dB. The noise sources are from YouTube and daily life noisy environmental recordings. For each utterance, we generated 10 different noisy versions for training. For evaluation, we report results on language-specific test sets, each contains roughly 15K anonymized, hand-transcribed utterances from Google’s voice search traffic without overlapping with the training data. This amounts to roughly 20 hours of test data per language. Details of each language dependent test set can be found in Table 2. We use word error rates (WERs) as the evaluation criterion for all the languages except for Japanese, where token error rates (TERs) are used to exclude the ambiguity of word segmentation.
4.1.2 Language Dependent Systems
We first build language dependent A2B models to investigate the performance of byte-based language representations for ASR. For comparison, we also build corresponding Audio-to-Char (A2C) models that have the same model structure but output graphemes. For all the four languages, the model which outputs byte always has a 256-dimensional softmax output layer. However, for the grapheme models, different grapheme vocabularies have to be used for different languages. The grapheme set is complete for English and Spanish as it contains all possible letters in each of the languages. However, for Japanese and Korean, we use the training data vocabularies which are K and K respectively. The corresponding test set grapheme OOV rates are 2.1% and 1.0%. Whereas with byte outputs, we do not have OOV problem for any language.
Experimental results are presented as A1 for the A2C models and A2 for the A2B models in Table 1. The difference between grapheme and byte representations mainly lies in languages which use multi-byte characters, such as Japanese and Korean. Comparing A1 to A2, byte outputs give better results for Japanese and Korean. While for languages with single-byte characters, namely English and Spanish, they have exactly the same performance as expected. Byte output requires the model to learn both the short-term within-grapheme byte dependencies and the long-term inter-grapheme or even inter-word/phrase dependencies; it would possibly be a harder task than grapheme based systems. However, the A2B model yields a 4.0% relative WER reduction on Japanese and 2.6% on Korean over the grapheme systems. It is interesting to see that even with the same model structure, we are able to get better performance with the byte representation.
4.1.3 Multilingual ASR Systems
In this experiment, we justify the effectiveness of byte based models over graphemes for multilingual speech recognition. We first build a joint English and Japanese model by equally mixing the training data. For grapheme system, we combine the grapheme vocab of English and Japanese which leads to a large softmax layer. The same model structure except for the softmax layer, where a 256 dimensional softmax is used, is used to build the A2B model. Although the model now needs to recognize two languages, we keep the model size the same as those language dependent ones. From Table 1, the multilingual byte system (B2) is better than the grapheme system (B1) on both English and Japanese test sets. However, its performance is worse than those language dependent ones, which we will address later in this work. For the following experiments, we continue with only the A2B models as they are better than A2C models.
To increase the model’s language coverage, e.g., Spanish, one way is to start from a random initialization and train on all the training data. We equally mix the data from these three languages for training. The results are presented as C1 in Table 1. Due to the language independence of the byte representation, we, alternatively, can add a new language by simply training on new data. Hence, we reuse the B2 model to continue training with Spanish data. To avoid the model forgetting previous languages, namely English and Japanese, we also mix in those languages but with a slightly lower mixing ratio which is 3:3:4 for English, Japanese and Spanish. The results are presented as C2 in Table 1. With this method, the byte model not only trains faster but also achieves better performance than C1. Most importantly, C2 matches the performance of language dependent models on Japanese and is even slightly better for Spanish.
To add support for Korean, we simply continued the training of C2
with the new training data mixture. We use a ratio of 0.23:0.23:0.23:0.31, which is based on heuristics to balance the existing languages and use a higher ratio for the new languages. We did not specifically tune the mixing ratio. The results (D1 in Table 1) show that we are able to get closer to the language dependent models except for English. Even though worse than the English only model, D1 gives the best multilingual performance on English so far.
To improve the performance of the multilingual systems, we first increase the number of decoder layers from 2 to 6 in consideration of the increased variations in byte sequences when mixing more languages. However, experimental results show that the larger model improves performance on English but degrades on Japanese due to potential over-fitting (comparing B3 to B2). To address this problem, we brings in the 1-hot language ID vector to all the layers in the A2B model. This enables the learning of language independent weight matrices together with language dependent biases to cater the specific needs for each language. Experiment B4 shows dramatic error reduction with this simple 1-hot vector comparing to B3.
Similarly, to support the recognition of Spanish, we continue the training of B4 by mixing the languages at the ratio of 3:3:4 where more weight is given to the new language. This gives us the model C3 which outperforms language dependent ones on both Japanese and Spanish. Furthermore, we add Korean in a similar way with the ratio of 0.3:0.15:0.15:0.4. This time while making sure the ratio for the new language, Korean, is the highest, we also increase the ratio for English as we have more English training data. The model D3 wins over language dependent models except for English. One assumption for the degradation on English is that when mixing in other languages, the multilingual model sees less data from each language than those single language models. To justify this, we continue the training of D3 with an increased English data presence ratio in the mixture, specifically we use the ratio of 2:1:1:1. We didn’t specifically tune these mixing ratios used for training. The final model D4 wins over all the language dependent ones on average by 4.4% relatively. For comparison, we include the results for a randomly initialized model with equal training data mixing ratio D2, which is much worse.
4.1.4 Error analysis
|D4||A2B Larger Model, LangVec||21.3|
To further understand the gains of using bytes versus graphemes as language representations, we take Japanese for this study and compare the decoding hypotheses between A1 and A2. Interestingly, the A2B model wins over the A2C models mainly on English words in utterances with mixed English and Japanese. The Japanese test set was not particularly created to include code-switching utterances. Examining the English words appeared in Japanese test set, they are mostly proper nouns such as “Google”, “wi-fi”, “LAN” etc. One example of such cases is the A2B generates the correct hypothesis “wi-fi オ ン” while the A2C outputs “i-i オ ン”. Another example is “google 音 声 認 識” where the A2B recognizes it correctly, but the A2C model drops the initial “g” and gives “oogle 音 声 認 識”.
One of the potential benefits of using byte-based models is for code-switching speech. Collecting such data is challenging. The quality of artificially concatenated speech is far from real. In this study we use data filtered from the Japanese test set, where utterances having transcript that contains 5 or more consecutive English characters are kept. These utterances mostly contain only a single English word in Japanese texts. Out of the 17.6K utterances, we get 476 code-switching sentences and we report the TERs on this subset in Table 3. With Japanese monolingual models (A1 and A2), our A2B model outperforms the A2C model by 38.6% relatively. With English and Japanese multilingual models (B1 and B2), our A2B model wins over the A2C model by 4.2% relatively. We also test system D4 on these code-switch data. However, due to the language 1-hot vector used in D4 is utterance-level, the performance is worse than B2. Using frame/segment level language information may address this problem, which will be explored in future.
4.2 Byte for TTS
Text-to-speech models were trained on (1) 44 hours of North American English speech recorded by a female speaker; (2) 37 hours of Mandarin speech by a female speaker; (3) 44 hours of North American Spanish speech by a female speaker. For all compared models, we synthesize raw audio at 24 kHz in 16-bit format. We rely on crowdsourced Mean Opinion Score (MOS) evaluations based on subjective listening tests. All our MOS evaluations are aligned to the Absolute Category Rating scale , with rating scores from 1 to 5 in 0.5 point increments.
4.2.2 Multilingual TTS System
Speech naturalness Mean Opinion Score (MOS) with 95% confidence intervals across different language and systems.
Table 4 compares subjective naturalness MOS of the proposed model to the baseline using graphemes for English, Mandarin and Spanish respectively. Both results indicate that the proposed multilingual B2A model is comparable as the state-of-the-art monolingual model111MOS is worse than  because we have OOV in the test set.. Moreover, we observed that the B2A model was able to read code-switching text. However, we don’t have good metric to evaluate the quality of code-switching for TTS, e.g. the speech is fluent but the speaker is changed for different language. Future work may explore how to evaluate TTS on code-switching scenario and how to disentangle language and speaker given more training data.
In this paper, we investigated the use of Unicode bytes as a new language representation for both ASR and TTS. We proposed Audio-to-Byte (A2B) and Byte-to-Audio (B2A) as multilingual ASR and TTS end-to-end models. The use of bytes allows us to build a single model for many languages without modifying the model structure for new ones. This brings representation sharing across graphemes, and is crucial for languages with large grapheme vocabularies, especially in multilingual processing. Our experiments show that byte models outperform grapheme models in both multilingual and monolingual models. Moreover, our multilingual A2B model outperforms our monolingual baselines by % relatively on average. The language independence of byte models provides a new perspective to the code-switching problem, where our multilingual A2B model achieves % relative improvement over our monolingual baselines. Finally, we also show our multilingual B2A models match the performance of our monolingual baselines in TTS.
-  Tanja Schultz and Katrin Kirchhoff, Multilingual speech processing, Elsevier, 2006.
-  Hervé Bourlard, John Dines, Mathew Magimai-Doss, Philip N Garner, David Imseng, Petr Motlicek, Hui Liang, Lakshmi Saheer, and Fabio Valente, “Current trends in multilingual speech processing,” Sadhana, vol. 36, no. 5, pp. 885–915, 2011.
-  Mark JF Gales, Kate M Knill, and Anton Ragni, “Unicode-based graphemic systems for limited resource languages,” in ICASSP, 2015.
-  Stephan Kanthak and Hermann Ney, “Context-dependent acoustic modeling using graphemes for large vocabulary speech recognition,” in ICASSP, 2002.
-  Mirjam Killer, Sebastian Stuker, and Tanja Schultz, “Grapheme based speech recognition,” in Eighth European Conference on Speech Communication and Technology, 2003.
-  Sebastian Stüker and Tanja Schultz, “A grapheme based speech recognition system for russian,” in 9th Conference Speech and Computer, 2004.
-  Willem D Basson and Marelie H Davel, “Comparing grapheme-based and phoneme-based speech recognition for afrikaans,” in PRASA, 2012.
Alex Graves and Navdeep Jaitly,
“Towards end-to-end speech recognition with recurrent neural networks,”in ICML, 2014.
-  William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “Listen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition ,” in ICASSP, 2016.
-  Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua Bengio, “End-to-end attention-based large vocabulary speech recognition,” in ICASSP, 2016.
-  William Chan and Ian Lane, “On Online Attention-based Speech Recognition and Joint Mandarin Character-Pinyin Training,” in INTERSPEECH, 2016.
-  William Chan, Yu Zhang, Quoc Le, and Navdeep Jaitly, “Latent Sequence Decompositions,” in ICLR, 2017.
-  Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Ekaterina Gonina, Navdeep Jaitly, Bo Li, Jan Chorowsk, and Michiel Bacchiani, “State-of-the-art Speech Recognition With Sequence-to-Sequence Models,” in ICASSP, 2018.
Albert Zeyer, Kazuki Irie, Ralf Schluter, and Hermann Ney,
“Improved training of end-to-end attention models for speech recognition,”in INTERSPEECH, 2018.
-  K. Rao, R. Prabhavalkar, and H. Sak, “Exploring Architectures, Data and Units for Streaming End-to-End Speech Recognition with RNN-Transducer,” in ASRU, 2017.
-  Hagen Soltau, Hank Liao, and Hasim Sak, “Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition,” in arXiv:1610.09975, 2016.
-  Jinyu Li, Guoli Ye, Amit Das, Rui Zhao, and Yifan Gong, “Advancing Acoustic-to-Word CTC Model,” in ICASSP, 2018.
-  Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner, Aaron Courville, and Yoshua Bengio, “Char2wav: End-to-end speech synthesis,” in ICLR: Workshop, 2017.
-  Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al., “Tacotron: A fully end-to-end text-to-speech synthesis model,” arXiv preprint, 2017.
-  Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, et al., “Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions,” in ICASSP, 2018.
-  Rohit Prabhavalkar, Kanishka Rao, Tara N Sainath, Bo Li, Leif Johnson, and Navdeep Jaitly, “A comparison of sequence-to-sequence models for speech recognition,” in INTERSPEECH, 2017, pp. 939–943.
-  Dan Gillick, Cliff Brunk, Oriol Vinyals, and Amarnag Subramanya, “Multilingual language processing from bytes,” in NAACL, 2016.
-  Bo Li, Tara N Sainath, Khe Chai Sim, Michiel Bacchiani, Eugene Weinstein, Patrick Nguyen, Zhifeng Chen, Yonghui Wu, and Kanishka Rao, “Multi-dialect speech recognition with a single sequence-to-sequence model,” in ICASSP, 2018.
-  Shubham Toshniwal, Tara N Sainath, Ron J Weiss, Bo Li, Pedro Moreno, Eugene Weinstein, and Kanishka Rao, “Multilingual speech recognition with a single end-to-end model,” in ICASSP, 2018.
-  Sepp Hochreiter and Jürgen Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio,
“Neural Machine Translation by Jointly Learning to Align and Translate,”2015.
-  Golan Pundak and Tara N Sainath, “Lower Frame Rate Neural Network Acoustic Models,” in INTERSPEECH, 2016.
-  Haşim Sak, Andrew Senior, Kanishka Rao, and Françoise Beaufays, “Fast and accurate recurrent neural network acoustic models for speech recognition,” arXiv preprint arXiv:1507.06947, 2015.
-  Peter Auer, Code-switching in conversation: Language, interaction and identity, Routledge, 2013.
-  Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aäron van den Oord, Sander Dieleman, and Koray Kavukcuoglu, “Efficient neural audio synthesis,” in ICML, 2018.
-  C. Kim, A. Misra, K. Chin, T. Hughes, A. Narayanan, T. N. Sainath, and M. Bacchiani, “Generated of Large-scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-field Speech Recognition in Google Home,” in INTERSPEECH, 2017.
-  ITUT Rec, “P. 800: Methods for subjective determination of transmission quality,” International Telecommunication Union, Geneva, 1996.