DeepAI AI Chat
Log In Sign Up

Bytes are All You Need: End-to-End Multilingual Speech Recognition and Synthesis with Bytes

by   Bo Li, et al.

We present two end-to-end models: Audio-to-Byte (A2B) and Byte-to-Audio (B2A), for multilingual speech recognition and synthesis. Prior work has predominantly used characters, sub-words or words as the unit of choice to model text. These units are difficult to scale to languages with large vocabularies, particularly in the case of multilingual processing. In this work, we model text via a sequence of Unicode bytes, specifically, the UTF-8 variable length byte sequence for each character. Bytes allow us to avoid large softmaxes in languages with large vocabularies, and share representations in multilingual models. We show that bytes are superior to grapheme characters over a wide variety of languages in monolingual end-to-end speech recognition. Additionally, our multilingual byte model outperform each respective single language baseline on average by 4.4 code-switching speech, our multilingual byte model outperform our monolingual baseline by 38.6 speech synthesis model using byte representations which matches the performance of our monolingual baselines.


page 1

page 2

page 3

page 4


Towards Language-Universal End-to-End Speech Recognition

Building speech recognizers in multiple languages typically involves rep...

Multilingual context-based pronunciation learning for Text-to-Speech

Phonetic information and linguistic knowledge are an essential component...

Phonological Features for 0-shot Multilingual Speech Synthesis

Code-switching—the intra-utterance use of multiple languages—is prevalen...

An Online Multilingual Hate speech Recognition System

The exponential increase in the use of the Internet and social media ove...

Prompting Large Language Models with Speech Recognition Abilities

Large language models have proven themselves highly flexible, able to so...

Massively Multilingual Adversarial Speech Recognition

We report on adaptation of multilingual end-to-end speech recognition mo...