It has become a standard practice to build a vocabulary in neural machine translation (NMT) [1, 28] using byte-pair encoding (BPE) . In this practice, we notice that BPE is used at the level of characters rather than at the level of bytes, which is more common in data compression. We suspect this is because text is often represented naturally as a sequence of characters, although it has recently been noticed that byte representation of text has its own advantages, such as compactness (up to 256 possible values) and being agnostic to languages.
In this paper, we look into byte-level “subwords” that are used to tokenize text into variable-length byte -grams, as opposed to character-level subwords in which we represent text as a sequence of character
-grams. We specifically focus on byte-level BPE (BBPE), examining compact BBPE vocabularies in both bilingual and multilingual settings as well as in a novel setup of transfer learning to a new language with a non-overlapping character set.
Byte Level Text Representation
Encoding Byte-Level Representation
We consider UTF-8 encoding of text, which encodes each Unicode character into 1 to 4 bytes. This allows us to model a sentence as a sequence of bytes instead of characters. While there are 138K Unicode characters covering over 150 languages, we represent a sentence in any language as a sequence of UTF-8 bytes (248 out of 256 valid bytes).
A byte sequence representation of text is often much longer (up to 4x) than a character sequence representation, which makes it computationally demanding to use bytes as they are. As an alternative, we consider segmenting a byte sequence into variable-length -grams (byte-level “subwords”). Specifically, we learn BPE vocabulary on the byte-level representation which extends UTF-8 byte set with byte -grams. We denote this type of vocabulary as B(yte-level)BPE in the rest of the paper. Figure 1 shows an example of BBPE tokenization.
BBPE symbols can be partial characters shared by different characters or the combination of complete and partial characters. This arbitrariness may necessitate incorporating a larger context surrounding each symbol for disambiguation and learning the character boundaries. In this work, we base our experiments on Transformer  models. We propose to use either a depth-wise convolutional layer 
or a bidirectional recurrent layer with gated recurrent units[6, GRU,] to contextualize BBPE embeddings before feeding them into the model:
Decoding with Byte-Level Subwords
While any sentence can be represented as a byte sequence, the converse is, however, not necessarily true in that there are byte sequences that do not translate to valid character sequences. In our system, we try to recover as many Unicode characters as possible from the potentially corrupted byte sequence from the model output. The algorithm is as follows: For a given byte sequence , we denote the maximum number of characters that we can recover from it as . Then has optimal substructure and can be solved by dynamic programming:
where if corresponds to a valid character, otherwise . When is calculated recursively, we also record the selections at each position so that we can recover the solution through backtracking. The design of UTF-8 encoding ensures the uniqueness of this recovery process: for a character UTF-8 encoded with multiple bytes, its trailing bytes will not make a valid UTF-8 encoded character. Then the best selection in Eq. 1 is unique and so is the final solution.
We run experiments on three bilingual corpora as well as a many-to-English multilingual dataset:
Sinhala-English (Si-En): we use the data from FLoRes .
Many-to-English (X-En): we adopt the TED Talks corpus complied by , which includes parallel data for 59 languages. For our experiments, we use English as target and the other 58 languages as source. We sample 22K examples from the 135K development set for validation.
Models and Learning
(assuming 4K vocabulary in parameter counting). We set attention and ReLU dropout to, except Si-En for which we use . We use 0.2 residual dropout for models in X-En.
Inference and Evaluation
Results and Analysis
Qualitative Comparison: BPE vs. BBPE
Symbol Frequency Distribution
Since the construction of BBPE vocabulary starts from UTF-8 byte set, it has the flexibility of decomposing rare characters into byte -grams from the vocabulary instead of including them directly. This frees vocabulary slots for other frequent symbols. Figure 2 compares the symbol frequency distribution of BPE and BBPE. We can see that BBPE symbols are more evenly distributed than BPE ones, even when the latter has already been much more evenly distributed than pure characters. By setting different BBPE vocabulary sizes, we can control the level of rare character decomposition and symbol sharing across different characters. Table 3 shows the ratio of BBPE tokens with partial characters. We can see that large portion of rare characters are decomposed on Ja-En and X-En, which has a large character set of 8K and 11K, respectively. Figure 5 provides an example from Ja-En tokenized with different BBPE vocabularies, where we can see how tokens look like as the tokenization granularity goes from fine to coarse.
In the multilingual setting, symbol sharing also happens across different languages despite the different writing systems. This allows maximizing parameter sharing not only for the model part but also the vocabulary part in a universal model. Figure 3 illustrates the level of BBPE symbol sharing across the top 5 languages (by number of train examples) in X-En whose writing systems are different from each other.
Impact on Sequence Lengths
Compared to BPE, BBPE symbols are generally finer-grained with shorter byte-level lengths, which results in longer tokenized sequences as well as longer training and inference time. BBPE, however, is optimized for compression-based objective (the same as BPE), and is still more efficient than character vocabulary. Table 4 lists the average lengths of training sentences tokenized with different vocabularies. We can observe that sentences tokenized with BBPE have significantly shorter lengths than the character ones, even when the BBPE vocabulary is much smaller (for example only of character set size on X-En). Another observation is that source-target length ratio for BBPE tends to be much larger when source character set and target character set have very different sizes (for example 11K for X-En source side and 0.1K for the target side). And this situation becomes more severe when BBPE vocabulary size increases. In this case, alignments may be more difficult to learn during model training, since target tokens need attentions on multiple source tokens more often.
Importance of Contextualization
We compare three different ways of contextualizing token embeddings; none, 1-layer convolution and 1-layer bi-GRU, on X-En with model. We observe from Figure 4 that all kinds of vocabularies can benefit from embedding contextualization. Performance gains are more significant on fine-grained vocabularies: byte, character and BBPE. For BBPE, long-range contextual information from Bi-GRU brings over gain on validation BLEU in all the cases. Encoding context in the token embeddings reduces the difficulties of learning attentions on multiple source tokens and makes model training easier. In the following experiments, we contextualize BBPE with Bi-GRU by default. We denote (B)BPE with Bi-GRU as “(B)BPE size+” and the one without contextualization as “(B)BPE size”. And we similarly define “Byte+” and “Char+”.
BBPE on Noisy Character Sets
The En-De training set has quite a few noisy sentence pairs often containing a few non-latin alphabets due to misalignment and code-switched sentences. This leads to a 3.4K character set, while in contrast, English and German both have less than 30 alphabets. Since BPE includes all characters, those rare characters will waste quite a lot of BPE vocabulary slots. For comparison, we try with small BBPE 2K and 4K vocabulary where rare characters are excluded. We find that their performance are comparable to the BPE 32K baseline while having smaller model capacity (see table 5).
BBPE on Character-Rich Languages
Languages using logographic writing systems, such as Chinese and Japanese, can have over 50K characters, while only a small portion of them are frequently used. Our Ja-En dataset has a set of 7.9K characters, where 99.99% tokens in the training set are covered by the top 2.4K characters. With this observation, we experiment with BBPE 4K which is roughly of the character set size. We find that BBPE is comparable to BPE and even outperforms BPE when using larger model (see table 6).
|# of train samples||440K||223K||2.8M||3.5M|
|# of test samples||1.2K||8.5K||2K||11.7K|
|Michel et.al. (2018)||20.77||13.25||18.00||-|
BBPE on Many-to-En Translation
Our many-to-En dataset contains 58 languages (parallelly to English) and 10.8K characters from different writing systems, between which characters are not necessarily shared. The characters, however, share byte -grams. We experiment with BBPE 2K and 4K that have and size of the baseline BPE vocabulary. As shown in Table 7, both of them beat the BPE baseline on overall BLEU as well as on most of the languages both with high and low resources (Note that the test set is as large as 165K and even small gaps in BLEU may suggest significant difference). We also notice that byte model and character model perform significantly better than all BPE and BBPE models in this multilingual setting. This might be because that BBPE and BPE suffer from imbalanced source and target sentence lengths as well as various token granularities in multilingual parallel sentences (sources in different languages and granularities into same targets). Nonetheless, BBPE is still the most practical solution since it makes a good balance between performance (better BLEU than BPE) and speed (much shorter tokenized sentences than characters and bytes).
|# of train examples||213K||167K||211K||203K||5.9K||4.5K||10K||61K||5.1M|
|# of test examples||6K||4.5K||5.5K||5.6K||0.9K||0.7K||1K||2.4K||165K|
|Aharoni et al. 19||25.93||28.87||30.19||32.42|
|Neubig & Hu 18||11.7||18.3||29.1||28.3|
|BBPE 4K+||X-En||enc, dec||8.1|
|BBPE 4K+||X-En||embed, enc||9.0|
Transfer Learning on Unseen Characters
Because BBPE contains all UTF-8 bytes and has no out-of-vocabulary tokens, BBPE-based models can be transferred between languages with non-overlapping character sets. In comparison, it is impossible to do so with character-based vocabularies without replacing the vocabulary and re-training embeddings from scratch. Our Si-En dataset has 77 Sinhala scripts that are disjoint with the X-En character set. We experiment transferring a pretrained (on X-En) BBPE 4K model to this dataset while reusing the original vocabulary. As shown in table 8, the transferred model gains 0.9-1.8 BLEU points compared to the baselines, suggesting the generality of pretrained BBPE embeddings and its ability to adapt to different languages with unseen characters. This transfer learning paradigm is free from the limitation of out-of-vocabulary tokens and can be very generic. We just show the extreme case of totally unseen character set, but the pre-trained model may also be transferred to any languages and datasets to improve performance or warm-start model training to save time.
Machine Translation with Small Granularity
Previous works have shown that, instead of dealing with a huge word-based vocabulary, working on smaller granularity with a reduced vocabulary size consistently benefits the training and inference of machine translation. For example, methods based on morpheme segmentation [21, 17] have been explored in machine translation.  proposed to use byte-pair encoding (BPE) to unsupervisedly segment words into subword units. [14, SentencePiece] followed similar ideas of BPE with the extension of direct training from raw sentences which allows to make a purely end-to-end system that does not depend on language-specific pre-processing or post-processing. Our byte-level subwords are mostly based on the implementation of SentencePiece, while we use bytes as the basic units to compose subwords.
Existing works also explored fully character-level models for machine translation.  proposed to build word representations based on its characters;  removed the restriction of word boundary and directly learned decoding in character level; further extended to a fully character NMT model, working in a multilingual setting;  showed that given the translation model with enough capacity, character-level models generally outperforms other subword-based models. Following the similar line, we also investigated fully byte-level models in our experiments.
The closest work to ours is the byte-level BPE vocabulary used in GPT-2, a large-scale English language model
. They however rely heavily on hard-coded merging rules and have not conducted any analysis on how their bye-level BPE impacts the quality of language modeling. A vocabulary consisting purely of bytes has previously been used in several natural language processing tasks: part-of-speech tagging and named entity recognition, translation, machine reading  and speech recognition .
Transformer with Convolution or RNN
We proposed BBPE which builds a byte-level subword vocabulary for machine translation. It results in a much more compact vocabulary than character-based ones do without the loss of performance. In multilingual settings, the former often outperforms the latter. BBPE does not have any out-of-vocabulary tokens, allowing us to transfer a model using BBPE between languages with non-overlapping vocabularies. This transfer learning paradigm is actually very generic and can be applied to any languages and datasets for performance gain or training acceleration. With the same vocabulary size, BBPE segments sentences into shorter sequences than character-based methods do, leading to faster training and inference. Our future work includes: eliminating source-target sentence length imbalance; evaluating BBPE in one-to-many and many-to-many translation settings; exploring better segmentation algorithms for byte-level subwords.
-  (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: Introduction.
-  (2012) Wit3: web inventory of transcribed and translated talks. In Conference of European Association for Machine Translation, pp. 261–268. Cited by: 2nd item.
-  (2018) The best of both worlds: combining recent advances in neural machine translation. arXiv preprint arXiv:1804.09849. Cited by: Transformer with Convolution or RNN.
-  (2019) Language models with transformers. In ArXiv e-prints, Cited by: Transformer with Convolution or RNN.
-  (2018) Revisiting character-based neural machine translation with capacity and compression. arXiv preprint arXiv:1808.09943. Cited by: Machine Translation with Small Granularity.
-  (2014) Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734. Cited by: Encoding Byte-Level Representation.
-  (2016) A character-level decoder without explicit segmentation for neural machine translation. arXiv preprint arXiv:1603.06147. Cited by: Machine Translation with Small Granularity.
-  (2017) Byte-based neural machine translation. In Proceedings of the First Workshop on Subword and Character Level Models in NLP, Cited by: Byte-based vocabulary.
-  (2016) Multilingual language processing from bytes. In Proceedings of NAACL-HLT, Cited by: Byte-based vocabulary.
-  (2019) Two new evaluation datasets for low-resource machine translation: nepali-english and sinhala-english. Cited by: 3rd item, Table 8.
-  (2017) Depthwise separable convolutions for neural machine translation. arXiv preprint arXiv:1706.03059. Cited by: Encoding Byte-Level Representation.
Byte-level machine reading across morphologically varied languages.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: Byte-based vocabulary.
-  (2016) Character-aware neural language models. In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: Machine Translation with Small Granularity.
-  (2018) SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 66–71. Cited by: Datasets, Machine Translation with Small Granularity.
-  (2017) Fully character-level neural machine translation without explicit segmentation. Transactions of the Association for Computational Linguistics 5, pp. 365–378. Cited by: Machine Translation with Small Granularity.
-  (2019) Bytes are all you need: end-to-end multilingual speech recognition and synthesis with bytes. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: Byte-based vocabulary, Transformer with Convolution or RNN.
Better word representations with recursive neural networks for morphology. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 104–113. Cited by: Machine Translation with Small Granularity.
-  (2018) Mtnt: a testbed for machine translation of noisy text. arXiv preprint arXiv:1809.00388. Cited by: 2nd item.
-  (2019) Transformers with convolutional context for asr. arXiv preprint arXiv:1904.11660. Cited by: Transformer with Convolution or RNN.
-  (2011) The Kyoto free translation task. Note: http://www.phontron.com/kftt Cited by: 2nd item.
-  (2000) Improving smt quality with morpho-syntactic analysis. In Proceedings of the 18th conference on Computational linguistics-Volume 2, pp. 1081–1085. Cited by: Machine Translation with Small Granularity.
-  (2019) Fairseq: a fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, Cited by: Models and Learning.
-  (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: Inference and Evaluation.
-  (2018) A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 186–191. External Links: Cited by: Inference and Evaluation.
-  (2017-10) JESC: Japanese-English Subtitle Corpus. ArXiv e-prints. External Links: Cited by: 2nd item.
-  (2019) Language models are unsupervised multitask learners. Cited by: Byte-based vocabulary.
-  (2015) Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Cited by: Introduction, Machine Translation with Small Granularity.
-  (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: Introduction.
-  (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: Encoding Byte-Level Representation, 1st item, Models and Learning, Table 5.
-  (2018) When and why are pre-trained word embeddings useful for neural machine translation. In HLT-NAACL, Cited by: 4th item.