Neural NLP models typically embed the sequence of input tokens using a lookup table of learnable parameters, where each row represents a token type as a dense vector bengio2003neural. The same embedding matrix is often reused to predict the output in language models press-wolf-2017-using. How essential are embeddings to the model’s success? Intuitively, one would expect them to be critical, given the ubiquitous use of embeddings layers and the vast amount of parameters they typically consume. In this work, we show that machine translation models can be trained without any embedding parameters, and that they can rival and sometimes even outperform standard embedding-based models.
We remove the trainable embedding matrix from a standard transformer machine translation model, and use a constant one-hot encoding of the vocabulary instead. To limit the dimensionality, we use byte tokenization by reading the text as a unicode (UTF-8) byte stream, which can represent virtually every text in any language in under 256 dimensions per token. Byte vocabularies obviate the need to preprocess the text with hand-crafted language-specific tokenizerskoehn-etal-2007-moses; adler-elhadad-2006-unsupervised and subword induction algorithms, such as BPE Sennrich_2016; kudo-2018-subword.
Machine translation experiments on 10 language pairs show that models without a trainable embedding matrix perform on par with the best embedding-based baselines. We find that embeddingless models consistently achieve higher BLEU scores than their byte baselines, and even yield slightly better performance than embedding-based character models in 80% of the cases. Although the recent literature on character-based transformers demonstrates the superiority of subword tokenization when controlling for network depth gupta2019character; wang2019neural; gao-etal-2020-character, our experiments show that removing the embedding matrix from byte-to-byte models makes them perform at least as well as standard subword models in 9 out of 20 cases. Overall, our results suggest that highly-parameterized embedding matrices might not be as essential as commonly perceived.
|Original Text||Будь здоров.|
2 Byte Tokenization
To enable an embeddingless model that reads (and predicts) a sequence of one-hot vectors, we must cap the number of token types at a relatively low number. While character tokenization could work for language pairs with certain writing systems (e.g. English, Arabic, Russian), it will not scale well for others (e.g. Chinese, Japanese). Instead, we represent text as a sequence of bytes based on UTF-8 encoding. This allows us to represent virtually every computerized text in the world by representing each character as a variable number of bytes; English characters are typically represented by a single byte, with other systems taking two (e.g. Arabic), three (e.g. Chinese), or four (e.g. emojis) bytes per character. Using byte tokenization also ensures that there are no out-of-vocabulary items (“unks”) by definition. Figure 1 illustrates the difference between subword, character, and byte tokenization.
3 Embeddingless Model
Our model is based on the original transformer encoder-decoder vaswani2017attention with one main difference: we completely remove the input and output trainable token embedding layers. These layers are usually merged into one matrix that contains a -dimensional embedding vector for each source and target vocabulary item .
Instead, we use a fixed one-hot representation of our byte vocabulary; for instance, the character “R” is represented as a vector with 1 at dimension 82 and 0 elsewhere.222 Since it is standard practice to use representations of more than 256 dimensions, we simply pad the remainder of the vector with zeroes.
Since it is standard practice to use representations of more than 256 dimensions, we simply pad the remainder of the vector with zeroes.To predict the next token, we take the output of the last transformer decoder layer, , and apply a softmax across each vector’s dimensions (without multiplying it by an embedding matrix). Formal expressions of the input and output of our model and the original transformer are detailed in Figure 2.
We also remove the dropout layers on the encoder input and decoder output, since zeroing out entries of one hot-hot vectors is equivalent to randomly masking out input tokens or deleting significant parts of the model’s predicted distribution. However, we find that using dropout on the decoder input (prefix of the target sequence fed with teacher forcing) does have a small positive effect in preliminary experiments, and apply it in our main experiments.
Omitting the embedding layer reduces the number of parameters by a factor of .333For subword tokenization, this accounts for a significant portion of the parameter budget, but for byte-based models the added parameter cost is negligible. We do add to our model a total of 3 parameters to scale the encoder and decoder’s (one-hot) inputs and the decoder’s output (before the softmax). We initialize all three with , akin to the constant scaling factor typically applied to the input embedding layer in transformers.
We train byte-tokenized embeddingless models for machine translation and compare them to standard embedding-based models with byte, character, and subword tokenization.
We use the IWSLT444All languages used the IWSLT2014 data except for Vietnamese (IWSLT2015) and Japanese (IWSLT2017). datasets of English TED talks translated into other languages cettolo2014report, selecting 10 additional languages with varying characteristics (see Table 1). For each such language, we train translation models both to and from English. We clean the training data for every language pair by first removing sentences longer than 800 bytes, and then the sentences with the largest byte-length ratio between source and target such that we remove a total of 5% of the training examples.
In addition to the byte-based embeddingless transformer, we train standard transformer encoder-decoder models as baselines, each one using a different tokenization scheme: subword, character, and byte. For subword tokenization, we apply the Moses tokenizer koehn-etal-2007-moses followed by BPE Sennrich_2016. Both character and byte tokenizations apply no additional preprocessing at all, and include whitespaces as valid tokens.
The code for our model and baselines is based on Fairseq ott-etal-2019-fairseq implementation of the transformer encoder-decoder model. During preprocessing we use 10,000 merging steps when building the BPE vocabulary for every language pair. The vocabularies and embeddings are always shared among source and target languages. In every transformer we use 6 encoder and decoder layers, 4 attention heads, a hidden dimension of 512, and feed-forward dimension of 1024. We optimize with Adam kingma2014adam, using the inverse square root learning rate scheduler with 4000 warmup steps and a peak learning rate of , label smoothing of 0.1, and weight decay of
. We train each model for 50k steps, and average the top 5 checkpoints according to the validation loss. We tune dropout (0.2 or 0.3) on the validation set. We set the batch size according to a maximum of 64,000 bytes per batch, which controls for the number of batches per epoch across different tokenization methods.
We evaluated our models using SacreBLEU, case-sensitive, with the 13a tokenizer for all languages except Chinese (ZH tokenizer) and Japanese (MeCab tokenizer). We use the original raw text as the reference for all of our experiments, instead of using the default tokenized-detokenized version, which normalizes the text and gives an artificial advantage to text processed with Moses.
Table 2 shows our experiments’ results. Every row describes the test BLEU scores of the embeddingless model and the three baselines trained on a different language pair. We discuss the implications of these results below.
Are embeddings essential?
The results show that it is indeed possible to train embeddingless machine translation models that perform competitively. The performance gaps between models with different tokenization schemes are relatively small. With the exception of Vietnamese, the difference between the embeddingless model and the best embedding-based model is always under 1 BLEU.
In the most controlled setting, where we compare byte-based models with and without learnable embeddings, models without embeddings consistently achieve higher BLEU scores in 19 of 20 cases (and an equal score for ru-en), with a boost of about 0.5 BLEU on average. When compared to models based on character embeddings, the embeddingless byte-to-byte approach yields higher BLEU scores in 16 out of 20 cases, though the average difference is quite small in practice (0.26 BLEU).
Is subword tokenization superior to bytes or characters?
Previous work in machine translation shows that subword models consistently outperform character or byte-based models gupta2019character; wang2019neural; gao-etal-2020-character. However, our results indicate that this is not necessarily the case. When translating from English to a foreign language, embeddingless byte-to-byte models achieve performance that is equal or better than subword embedding models’ in 8 out of 10 cases. We observe a different trend when translating into English, where subword models match or surpass other models for every source language.555The fact that Moses is a particularly good tokenizer for English – and less so for other languages – is perhaps related to this phenomenon. Whereas prior work proposed closing the performance gap by adding layers to the basic architecture, under the assumption that character-based models lack capacity of expressiveness, our results show that actually removing a component from the model can improve performance under certain conditions. It is possible that transformer-based character and byte-based models encounter an optimization issue rather than one of capacity or expressivity.
Why does removing the embedding matrix improve the performance of byte-based models?
We hypothesize that, unlike words and subwords, bytes are orthogonal; i.e they do not have semantic similarities with each other. Following that intuition, can byte models still benefit from learning an embedding layer if it is only initialized as a fully orthogonal one-hot matrix? We conduct this experiment on en-ru data and find that one-hot initialization does help the byte-embedding model achieve a slightly better score than the baseline (17.6 BLEU with one-hot initialization versus 17.4 without), but it does not match the performance of the embeddingless model (18.2 BLEU), which uses constant one-hot representations.
Can forcing orthogonality improve subword models?
To test whether fixed orthogonal vector representations can also improve subword models, we train a subword model on the en-ru dataset, in which we freeze the randomly initialized embedding matrix and do not update it while training.666Random initialization creates vectors with relatively low similarity in practice, approximating orthogonality. One-hot representations of subword vocabularies are impractical. This model achieves 16.3 BLEU, a degradation of 1.8 points from the subword baseline. The result suggests that, unlike characters and bytes, learnable embeddings do benefit subword models (though the gap might not be as significant as one would expect, considering the number of parameters involved).
6 Related Work
There is prior work on replacing language-specific tokenizers with more universal tokenization approaches. SentencePiece kudo-richardson-2018-sentencepiece takes a raw unicode string and tokenizes it into subwords using BPE Sennrich_2016 or unigram LM kudo-2018-subword. Byte BPE wang2019neural
extends SentencePiece to operate at the byte level. While this approach is indeed more language agnostic than heuristic tokenizers, it does suffer from performance degradation when no pre-tokenization (e.g. splitting by whitespace) is applied.777https://github.com/google/sentencepiece/blob/master/doc/experiments.md Moreover, the assumption that subword units must be contiguous segments does not hold for languages with non-concatenative morphology such as Arabic and Hebrew.
Character-to-character machine translation models lee-etal-2017-fully treat the text as a sequence of characters and do not require any form of preprocessing or word tokenization. Although earlier results on LSTM-based models show that character tokenization can outperform subword tokenization cherry-etal-2018-revisiting, recent literature shows that the same does not hold for transformers gupta2019character; wang2019neural; gao-etal-2020-character. To narrow the gap, recent work suggests using deeper models gupta2019character or specialized architectures gao-etal-2020-character. Our work deviates from this trend by removing layers to improve the model. This observation contests the leading hypothesis in existing literature – that the performance gap results from reduced model capacity – and suggests that the problem may be one of optimization.
This work tests the importance of the embedding matrix in neural machine translation models. Experiments on 10 different languages show that, despite its ubiquitous usage, competitive models can be trained without any embeddings. Future work may investigate the potential of embeddingless models for different NLP tasks, and explore new methods to improve training in byte-level models.