Most state-of-the-art neural machine translation (NMT) systems operate on the subword level, typically using a preprocessing technique like Byte-Pair Encoding (BPE)Sennrich et al. (2016) for combining characters into subwords. However, using a subword representation often masks important information given by characters, from the syntactic and morphological relatedness of words (e.g., “bake” and “bakes” may be assigned each their own unique token) to devices such as rhyme and alliteration. Although a byte-level pretrained language model, ByT5 Xue et al. (2021), was more robust to misspellings than its subword counterpart, T5 Raffel et al. (2019), results on common benchmarks such as GLUE did not reflect this inherent advantage. Similarly, in NMT, we have not yet seen advantages from character-level models reflected by their BLEU, chrF, or COMET scoresLibovickỳ et al. (2021).
The major inhibitor for character-level models is inefficiency, which is due to the longer input and output sequences. For example, the average number of characters per subword for English is around 4 Xue et al. (2021), so the input sequence to a character-level model is effectively 4 times longer. With a Transformer model, the problem is compounded by the complexity of self-attention. To address this, models such as the Charformer Tay et al. (2021) introduce a downsampling method prior to the Transformer, which combines characters into pseudo-words, reducing the length of the sequence. The downsampling method used in the Charformer, GBST, was originally intended only for use in the encoder, and recent works attempting to apply the GBST layer to the decoder have failed Libovickỳ et al. (2021).111In consultation with the authors, we noted that their results using GBST in decoding in fact used a different method. Our contributions include:
We show that there is an information leak in the GBST layer which breaks the typical NMT training scheme of a Transformer model.
We resolve the information leak issue, allowing it to be used in a Transformer decoder.
We provide a simple test to check for information leak in tasks which exhibit causality.
We show that despite Charformer’s current popularity, the GBST layer does not perform as well as earlier methods such as Lee et al. (2017) for NMT.
We give some evidence that character-level models are better for morphologically rich languages.
2 Patching Information Leaks
The Charformer modifies ByT5 with the addition of the gradient-based subword tokenization (GBST) layer. This layer mixes character representations (which are also informed of their relative positions via a convolution) using a simple mean computed over character n-grams up to length 4.222For decoding, we include n-grams up to the length of the downsampling factor. The model then selects a weighted average of these representations before a downsampling via a block-wise mean pooling. The result is a sequence of pseudo-word-level embeddings that is fed in directly to the Transformer (in contrast with the standard sequence of subword embeddings).
The GBST layer can be applied directly to a Transformer encoder without issue, but it cannot be applied to a Transformer decoder for generative tasks. This is due to an information leak, where information about characters in future blocks can end up in prior blocks. This occurs for 2 reasons: the convolution used to inform position, and the character n-gram means, both of which can overlap with the block-wise separations.
Concerning the convolution, each character is informed by its neighboring characters, which is necessary for the convolution to serve as a positional embedding. However, this creates the issue that characters on the right side of a block will be informed by characters to the left in a future block.
The character n-gram averaging similarly can obtain information from future blocks. As seen in Figure 1, with a downsampling factor of 4, the issue occurs with trigrams, where the 4th position is averaged with the 5th and 6th, despite being in separate blocks. Therefore, when the Transformer must predict the block containing the 5th and 6th characters, it has already received information about them, and thus can learn to simply copy the characters.
2.1 A Simple Test for Information Leaks
To confirm that there is a leak in the GBST layer, we set up a simple model consisting of a GBST layer, followed by an upsampling layer, which is a linear layer that takes the downsampled block representation and upsamples it back to characters. We then train the model to predict a sequence of random characters, conditioned on the prior characters in the sequence, using a left-padding of BOS tokens equal to the downsampling factor. (e.g., ‘[BOS] [BOS] [BOS] a b c’ is used to predict ‘a b c d e f’). We expect that, if there is no information leak, the model will have a near-random accuracy, and conversely if the accuracy is significantly higher than random, there is an information leak.
. We use a vocabulary of size 100, so the accuracy of a perfectly random classifier should be around 1%, however we see in several cases that the models perform significantly better. In the worst case, with a downsampling factor of 4 and using the convolutional positional embeddings, 75% of the tokens are leaked.
To resolve this issue, we considered three potential solutions: adding more padding, removing leaking n-grams, or applying a mask to the n-grams.
Approach 1: Additional Padding.
The simplest approach is to pad the sequences enough that leaking is impossible. Preliminary experiments indicated that padding with 2 times the downsampling factor prevented any leak. This approach however implies that the predictions are no longer conditioned on the block 1 step back, but rather the block 2 steps back.
Approach 2: Remove Overlapping N-Grams.
Our second approach instead addresses the sources of information leak. First, the convolution used as a positional embedding is replaced with a static sinusoidal positional embedding, as in Vaswani et al. (2017). Second, the means for n-grams where there is overlap are removed. For example, for a downsampling factor of 4, the means for trigrams are removed. These changes prevent information leak, but it is possible that the removal of trigrams or other n-grams will substantially worsen the model’s ability to learn contextual character embeddings before downsampling into blocks.
Approach 3: Apply Causal Mask.
Our third approach builds on our second approach by keeping the sinusoidal positional embeddings and, rather than removing problematic n-grams, applies a causal mask when computing the mean. This can be seen in Figure 2, where overlapping n-grams are split by block, with the right side being informed by the left, but the left side being uninformed by the right. This approach allows trigram information to remain when it is not overlapping, however the information present in the blocks is not always consistent. This might be difficult for the model to discern.
3 Experimental Setup
We test our models on English translation to and from Arabic and German, as well as Turkish, whose vowel harmony and agglutination should benefit from character-level processing. Details of the datasets, preprocessing, and hyperparameters can be found in AppendixB.
found a performant method of character-level MT using a downsampling method based on causal convolutions, followed by a max-pooling layer. Their original MT model used an RNN, andLibovickỳ et al. (2021) modernized the model to use a Transformer, keeping the convolution-based downsampling. It notably performed the best in Libovickỳ et al. (2021)’s experiments on various downsampling methods. We similarly use the Transformer version in our experiments.
For decoding with Lee et al.’s and the Charformer downsampling methods, we apply the two-step decoder of Libovickỳ et al. (2021), which adds an LSTM layer to the head of the Transformer, receiving its hidden states in addition to character embeddings of the so-far generated output sequence, and outputting the next characters. The parameters used for Lee et al.’s method follow Libovickỳ et al. (2021).
We compare the performance of our modified GBST layers in Table 2. First, we observe the GBST with additional padding, like the non-causal GBST, fails to learn any form of translation. This indicates that predicting two blocks into the future is too difficult a task for the model to make any meaningful progress in learning to translate.
We also see that the simpler approach of dropping overlapping n-grams works better than applying a causal mask. We suspect this is due to a lack of consistency within the masked n-gram representations, as some are simply duplicates of the lower order n-gram representations.
The downsampling factor also plays no role in the difference between the methods. Although the average length of a subword in English and German is close to 4 characters, making a downsampling factor of 4 an intuitive choice, a higher downsampling factor leads to worse performance.
Despite the modified GBST with n-grams removed (henceforth r-GBST) being the best performing modification, Table 3 shows that it is still outperformed by the downsampling method from Lee et al. (2017). The mode of operation between Lee et al.’s method and the GBST method is similar: both achieve a mixture of characters, focusing only on neighboring characters to reduce computational complexity. While GBST achieves this with averaging across unigrams to 4-grams, Lee et al. uses convolutions with differing kernel sizes. The mixing via convolution is more uniform in nature; for example, the convolution with kernel size 3 is analogous to the trigram mixing, however the convolution does not have a hard boundary after every 3rd character. This may be the reason for Lee et al.’s method’s superior performance.
These results raise the question of the significance of the performance of the Charformer for any NLP task. If Lee et al.’s method was used in the same pretraining setup used in Tay et al. (2021), would we perhaps see superior performance?
Looking at the results with respect to language pairs, the results on English–Turkish seem most promising for character-level models. The character-level model with no downsampling outperforms the subword-level in both directions, Lee’s method with is competitive, and r-GBST with is somewhat competitive into-English. This could indicate that models with access to character-level information benefit when translating both to and from languages with rich morphology such as Turkish, although more extensive testing on other morphologically rich languages is needed.
Both downsampling methods perform worse than using no downsampling, but there is a trade-off in training time. In Table 4, we show the time it takes to train and evaluate on the test set post-training. We include character and subword-level models where we use the two-step decoding method, in order to separate the effect of downsampling from the decoding method.444The performance of the 2-step character and subword models are similar to their normal counterparts. We can see that our downsampling methods are faster for both training and generation than the character model. However the subword model is still the fastest and achieves the best performance.
Character-level or byte-level models are intuitive for a variety of reasons but are slower due to much longer sequences. Downsampling methods such as the GBST layer in the Charformer looked promising, but is not usable for generative tasks without modification due to a leak of information flowing from future blocks. With modification, the GBST layer does not perform as well as older methods such as that of Lee et al. (2017), although it is faster in both training and evaluation.
Despite the intuitiveness of downsampling from characters to pseudo-words, we see a clear trade-off of performance versus time, with performance decreasing by several BLEU points, but the training and generation time being reduced to up to 50% of the non-downsampled model.
Although both downsampling methods tested do not reach the performance of the standard character-level model, subword-level models show that shortening the sequence length can lead to appreciably faster models without any sacrifice in performance. Thus some form of downsampling is beneficial, and our method to find information leaks can serve as a useful debugging tool.
Finally, the results for English–Turkish show a positive effect of character-level information for translation of morphologically rich languages.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix A.
- Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226. Cited by: §B.2.
- Fully character-level neural machine translation without explicit segmentation. Transactions of the Association for Computational Linguistics 5, pp. 365–378. Cited by: item 4, §3, §3, §4, §4, §5.
- Why don’t people use character-level machine translation?. arXiv preprint arXiv:2110.08191. Cited by: §B.3, §1, §1, §3, §3.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: §B.3.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 311–318. External Links: Cited by: §3.
Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683. Cited by: §1.
COMET: a neural framework for MT evaluation.
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 2685–2702. External Links: Cited by: §3.
- Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1715–1725. External Links: Cited by: §1.
- Charformer: fast character transformers via gradient-based subword tokenization. arXiv preprint arXiv:2106.12672. Cited by: §1, §4, footnote 8.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §B.3, §2.1.
- ByT5: towards a token-free future with pre-trained byte-to-byte models. arXiv preprint arXiv:2105.13626. Cited by: §1, §1.
Appendix A Test for Information Leaks
Our approach for testing for information leaks in a downsampling method trains for 5000 iterations in batches of 32. We use the Adam optimizer Kingma and Ba (2014), with a learning rate of 1e-4. These numbers were determined empirically, based on the degree of separation seen from the accuracies of the leaking versus non-leaking tokens. The accuracies in Table 1 are obtained over 100 batches, or 3200 samples.
Appendix B Main Experimental Details
Our experiments use the IWSLT2017 data555https://sites.google.com/site/iwsltevaluation2017/TED-tasks for English–German and English–Arabic, using the test sets from 2010 and 2015 for validation and testing, respectively. For English–Turkish, we use the SETIMES2 dataset666https://opus.nlpl.eu/SETIMES2.php, and WMT’s
newstest2018 for validation and testing.777http://data.statmt.org/wmt18/translation-task/dev.tgz
We remove sentence pairs where the English sentence is longer than 256 characters from the training data. To keep our vocabulary size consistent across all languages, we tokenize according to UTF-8 bytes rather than using a character vocabulary.888Operating on the byte-level also follows Tay et al. (2021), despite the name “Charformer” perhaps suggesting otherwise. We also include subword-level models, with vocabularies generated with SentencePiece Kudo and Richardson (2018). We chose a vocabulary size of 16 thousand for English–German and English–Arabic, as that roughly corresponds to a downsampling factor of 4. For English–Turkish, we used a vocabulary size of 8 thousand.999We also tested a vocabulary size of 16 thousand, but this performed worse.
We use the Transformer model with the same parameters as Transformer Base Vaswani et al. (2017). We train our models using AdamW Loshchilov and Hutter (2017) with a learning rate of 2e-4, a linear warmup of 4000 steps, a batch size of 128, and label smoothing factor of 0.1. The learning rate was chosen from a grid search, the batch size chosen empirically, and the warmup steps and label smoothing factor were based on Libovickỳ et al. (2021). We use an early stopping criterion of no improvement on the validation set with a patience of 10. All of our models are trained on a single Nvidia V100 (32GB) GPU.
Appendix C Comparison of GBST modifications on ArabicEnglish.
In addition to GermanEnglish (Table 2), we compare the GBST modifications on ArabicEnglish, shown in Table 5. The results are more promising for the masking method compared to those on EnglishGerman, however they are still outperformed by the removal method.