Neural Machine Translation with Byte-Level Subwords

09/07/2019 ∙ by Changhan Wang, et al. ∙ Facebook 0

Almost all existing machine translation models are built on top of character-based vocabularies: characters, subwords or words. Rare characters from noisy text or character-rich languages such as Japanese and Chinese however can unnecessarily take up vocabulary slots and limit its compactness. Representing text at the level of bytes and using the 256 byte set as vocabulary is a potential solution to this issue. High computational cost has however prevented it from being widely deployed or used in practice. In this paper, we investigate byte-level subwords, specifically byte-level BPE (BBPE), which is compacter than character vocabulary and has no out-of-vocabulary tokens, but is more efficient than using pure bytes only is. We claim that contextualizing BBPE embeddings is necessary, which can be implemented by a convolutional or recurrent layer. Our experiments show that BBPE has comparable performance to BPE while its size is only 1/8 of that for BPE. In the multilingual setting, BBPE maximizes vocabulary sharing across many languages and achieves better translation quality. Moreover, we show that BBPE enables transferring models between languages with non-overlapping character sets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

It has become a standard practice to build a vocabulary in neural machine translation (NMT) [1, 28] using byte-pair encoding (BPE) [27]. In this practice, we notice that BPE is used at the level of characters rather than at the level of bytes, which is more common in data compression. We suspect this is because text is often represented naturally as a sequence of characters, although it has recently been noticed that byte representation of text has its own advantages, such as compactness (up to 256 possible values) and being agnostic to languages.

In this paper, we look into byte-level “subwords” that are used to tokenize text into variable-length byte -grams, as opposed to character-level subwords in which we represent text as a sequence of character

-grams. We specifically focus on byte-level BPE (BBPE), examining compact BBPE vocabularies in both bilingual and multilingual settings as well as in a novel setup of transfer learning to a new language with a non-overlapping character set.

Figure 1: BPE (upper) and BBPE (lower) tokenization of a Japanese sentence. Bytes (from partial characters) are represented by hexadecimal digits.

Byte Level Text Representation

Encoding Byte-Level Representation

We consider UTF-8 encoding of text, which encodes each Unicode character into 1 to 4 bytes. This allows us to model a sentence as a sequence of bytes instead of characters. While there are 138K Unicode characters covering over 150 languages, we represent a sentence in any language as a sequence of UTF-8 bytes (248 out of 256 valid bytes).

A byte sequence representation of text is often much longer (up to 4x) than a character sequence representation, which makes it computationally demanding to use bytes as they are. As an alternative, we consider segmenting a byte sequence into variable-length -grams (byte-level “subwords”). Specifically, we learn BPE vocabulary on the byte-level representation which extends UTF-8 byte set with byte -grams. We denote this type of vocabulary as B(yte-level)BPE in the rest of the paper. Figure 1 shows an example of BBPE tokenization.

BBPE symbols can be partial characters shared by different characters or the combination of complete and partial characters. This arbitrariness may necessitate incorporating a larger context surrounding each symbol for disambiguation and learning the character boundaries. In this work, we base our experiments on Transformer [29] models. We propose to use either a depth-wise convolutional layer [11]

or a bidirectional recurrent layer with gated recurrent units 

[6, GRU,] to contextualize BBPE embeddings before feeding them into the model:

or

Decoding with Byte-Level Subwords

While any sentence can be represented as a byte sequence, the converse is, however, not necessarily true in that there are byte sequences that do not translate to valid character sequences. In our system, we try to recover as many Unicode characters as possible from the potentially corrupted byte sequence from the model output. The algorithm is as follows: For a given byte sequence , we denote the maximum number of characters that we can recover from it as . Then has optimal substructure and can be solved by dynamic programming:

(1)

where if corresponds to a valid character, otherwise . When is calculated recursively, we also record the selections at each position so that we can recover the solution through backtracking. The design of UTF-8 encoding ensures the uniqueness of this recovery process: for a character UTF-8 encoded with multiple bytes, its trailing bytes will not make a valid UTF-8 encoded character. Then the best selection in Eq. 1 is unique and so is the final solution.

Experimental Settings

Datasets

We run experiments on three bilingual corpora as well as a many-to-English multilingual dataset:

Table 1 shows an overview statistics of these datasets. We learn (B)BPE vocabularies jointly on source and target sentences using SentencePiece [14].

En-De Ja-En Si-En X-En
Train 4.5M 3.5M 405K 5.1M
Dev 3K 4K 3K 22K
Test 3K 12K 3K 165K
Table 1: Dataset statistics in number of sentences. Sub-sampled from the full 135K development set.
Model Params
5 512 2048 2 0.4 40M
6 512 2048 8 0.1 46M
6 1024 4096 16 0.3 184M
Table 2: Transformer models used in the experiments (using the notations in Vaswani et al. 2017).

Models and Learning

We use Fairseq [22] to train Transformers  [29] with the same learning rate schedule in the original paper. All model configurations are listed in table 2

(assuming 4K vocabulary in parameter counting). We set attention and ReLU dropout to

, except Si-En for which we use . We use 0.2 residual dropout for models in X-En.

Inference and Evaluation

We set beam width to 4 for En-De and 5 for the other and use the best checkpoint by validation loss to generate the predictions. We calculate case-sensitive tokenized BLEU  [23] as the metrics using sacreBLEU [24].

Results and Analysis

Qualitative Comparison: BPE vs. BBPE

Symbol Frequency Distribution

Since the construction of BBPE vocabulary starts from UTF-8 byte set, it has the flexibility of decomposing rare characters into byte -grams from the vocabulary instead of including them directly. This frees vocabulary slots for other frequent symbols. Figure 2 compares the symbol frequency distribution of BPE and BBPE. We can see that BBPE symbols are more evenly distributed than BPE ones, even when the latter has already been much more evenly distributed than pure characters. By setting different BBPE vocabulary sizes, we can control the level of rare character decomposition and symbol sharing across different characters. Table 3 shows the ratio of BBPE tokens with partial characters. We can see that large portion of rare characters are decomposed on Ja-En and X-En, which has a large character set of 8K and 11K, respectively. Figure 5 provides an example from Ja-En tokenized with different BBPE vocabularies, where we can see how tokens look like as the tokenization granularity goes from fine to coarse.

Figure 2: Symbol frequencies (in log2 scale) for En-De (top) and X-En (bottom) vocabularies. BBPE enables a more consistent distribution of vocabulary across frequencies.
BBPE 2K 4K 8K 16K 32K
En-De 4.3% 4.9% 5.5% 6.1% 6.5%
Ja-En 46.0% 47.6% 49.4% 51.2% 34.8%
X-En 36.8% 39.1% 41.3% 43.6% 23.0%
Table 3: Ratio of BBPE tokens with partial characters.

Cross-Lingual Sharing

In the multilingual setting, symbol sharing also happens across different languages despite the different writing systems. This allows maximizing parameter sharing not only for the model part but also the vocabulary part in a universal model. Figure 3 illustrates the level of BBPE symbol sharing across the top 5 languages (by number of train examples) in X-En whose writing systems are different from each other.

Figure 3: The numbers of languages symbols have shared across Ar, He, Ru, Ko and It (from X-En). Note that these languages have mutually different character sets.

Impact on Sequence Lengths

Compared to BPE, BBPE symbols are generally finer-grained with shorter byte-level lengths, which results in longer tokenized sequences as well as longer training and inference time. BBPE, however, is optimized for compression-based objective (the same as BPE), and is still more efficient than character vocabulary. Table 4 lists the average lengths of training sentences tokenized with different vocabularies. We can observe that sentences tokenized with BBPE have significantly shorter lengths than the character ones, even when the BBPE vocabulary is much smaller (for example only of character set size on X-En). Another observation is that source-target length ratio for BBPE tends to be much larger when source character set and target character set have very different sizes (for example 11K for X-En source side and 0.1K for the target side). And this situation becomes more severe when BBPE vocabulary size increases. In this case, alignments may be more difficult to learn during model training, since target tokens need attentions on multiple source tokens more often.

Byte BBPE Char BPE
256 1K 2K 3K 4K 8K 11K 16K 32K 3K 8K 11K 8K 16K 32K
En-De Source 143 57 48 43 41 36 33 30 143 40 33 31
Target 160 64 55 50 48 42 38 35 157 43 36 32
Ja-En Source 55 28 26 24 23 21 21 19 12 10
Target 53 23 20 17 15 15 13 52 15 14
X-En Source 126 77 70 65 62 60 59 57 89 40 32
Target 103 49 43 37 33 32 30 27 103 35 30
Table 4: Average lengths of training sentences tokenized with different vocabularies.

Importance of Contextualization

We compare three different ways of contextualizing token embeddings; none, 1-layer convolution and 1-layer bi-GRU, on X-En with model. We observe from Figure 4 that all kinds of vocabularies can benefit from embedding contextualization. Performance gains are more significant on fine-grained vocabularies: byte, character and BBPE. For BBPE, long-range contextual information from Bi-GRU brings over gain on validation BLEU in all the cases. Encoding context in the token embeddings reduces the difficulties of learning attentions on multiple source tokens and makes model training easier. In the following experiments, we contextualize BBPE with Bi-GRU by default. We denote (B)BPE with Bi-GRU as “(B)BPE size+” and the one without contextualization as “(B)BPE size”. And we similarly define “Byte+” and “Char+”.

Figure 4: X-En validation BLEU for models without contextualization, with local contextualization (depth-wise convolution) and with long-range contextualization (Bi-GRU).

BBPE on Noisy Character Sets

The En-De training set has quite a few noisy sentence pairs often containing a few non-latin alphabets due to misalignment and code-switched sentences. This leads to a 3.4K character set, while in contrast, English and German both have less than 30 alphabets. Since BPE includes all characters, those rare characters will waste quite a lot of BPE vocabulary slots. For comparison, we try with small BBPE 2K and 4K vocabulary where rare characters are excluded. We find that their performance are comparable to the BPE 32K baseline while having smaller model capacity (see table 5).

Test BLEU Params
Byte+ 26.59 45M
BBPE 2K+ 26.98 47M
BBPE 4K+ 27.08 47M
Char+ 26.73 47M
BPE 32K 27.31 61M
BPE 32K+ 27.41 62M
BPE 37K 27.3 65M
Byte+ 26.94 181M
BBPE 2K+ 28.78 183M
BBPE 4K+ 28.27 185M
Char+ 27.24 185M
BPE 32K 28.36 210M
BPE 32K+ 28.77 215M
BPE 37K 28.4 213M
Table 5: En-De test BLEU. [29].

BBPE on Character-Rich Languages

Languages using logographic writing systems, such as Chinese and Japanese, can have over 50K characters, while only a small portion of them are frequently used. Our Ja-En dataset has a set of 7.9K characters, where 99.99% tokens in the training set are covered by the top 2.4K characters. With this observation, we experiment with BBPE 4K which is roughly of the character set size. We find that BBPE is comparable to BPE and even outperforms BPE when using larger model (see table 6).

KFTT TED JESC All
# of train samples 440K 223K 2.8M 3.5M
# of test samples 1.2K 8.5K 2K 11.7K
Michel et.al. (2018) 20.77 13.25 18.00 -
Byte+ 23.12 15.14 15.69 16.27
BBPE 4K+ 24.15 15.59 16.10 16.80
Char+ 23.67 15.26 15.68 16.43
BPE 16K+ 23.63 16.15 16.18 17.19
Byte+ 23.68 16.08 16.29 17.46
BBPE 4K+ 23.88 19.0 17.93 19.58
Char+ 23.71 16.69 17.01 18.33
BPE 16K+ 24.08 18.34 17.89 19.14
Table 6: Ja-En test BLEU scores.

BBPE on Many-to-En Translation

Our many-to-En dataset contains 58 languages (parallelly to English) and 10.8K characters from different writing systems, between which characters are not necessarily shared. The characters, however, share byte -grams. We experiment with BBPE 2K and 4K that have and size of the baseline BPE vocabulary. As shown in Table 7, both of them beat the BPE baseline on overall BLEU as well as on most of the languages both with high and low resources (Note that the test set is as large as 165K and even small gaps in BLEU may suggest significant difference). We also notice that byte model and character model perform significantly better than all BPE and BBPE models in this multilingual setting. This might be because that BBPE and BPE suffer from imbalanced source and target sentence lengths as well as various token granularities in multilingual parallel sentences (sources in different languages and granularities into same targets). Nonetheless, BBPE is still the most practical solution since it makes a good balance between performance (better BLEU than BPE) and speed (much shorter tokenized sentences than characters and bytes).

Ar De He It Az Be Gl Sk All Params
# of train examples 213K 167K 211K 203K 5.9K 4.5K 10K 61K 5.1M
# of test examples 6K 4.5K 5.5K 5.6K 0.9K 0.7K 1K 2.4K 165K
Aharoni et al. 19 25.93 28.87 30.19 32.42
Neubig & Hu 18 11.7 18.3 29.1 28.3
Byte+ 31.13 35.98 36.77 38.36 14.64 25.12 35.12 33.08 30.38 45M
Char+ 31.52 36.73 36.85 38.62 15.40 24.90 35.44 33.31 30.75 51M
BBPE 2K+ 30.79 35.53 36.27 37.82 13.64 24.70 34.17 32.83 29.91 46M
BBPE 4K+ 30.64 34.93 36.07 37.62 13.76 24.84 33.90 32.12 29.74 47M
BPE 16K 29.70 34.35 34.47 37.02 13.28 24.61 33.55 31.72 29.00 53M
BPE 16K+ 30.20 34.97 35.55 37.49 12.65 23.66 33.95 32.16 29.62 54M
BPE 32K 29.02 34.08 34.18 36.63 12.56 22.48 32.33 31.26 28.81 61M
BPE 32K+ 29.87 34.64 35.26 37.43 12.35 22.05 33.62 31.61 29.43 62M
Table 7: X-En test BLEU on all 58 languages, top-4 (Ar, De, He, It) and bottom-4 (Az, Be, Gl, Sk) languages by number of training samples. Note that the test set is very large (165K) and even small gaps in BLEU may suggest significant difference.
Train Finetune BLEU
BPE 5K Si-En - 7.2
BBPE 4K+ Si-En - 7.1
BBPE 4K+ X-En - 0.3
BBPE 4K+ X-En enc 8.3
BBPE 4K+ X-En enc, dec 8.1
BBPE 4K+ X-En embed, enc 9.0
BBPE 4K+ X-En all 8.6
Table 8: Transferring pretrained X-En model to Si-En. BBPE 4K is learned on X-En. [10].
Figure 5: An example from Ja-En tokenized with different vocabularies. Raw spaces are replaced by underscores and spaces are used to split tokens. We can observe how tokens look like as the tokenization granularity goes from fine to coarse: Byte (256) BBPE (1K, 2K, 4K, 8K) Char (8K) BBPE (16K, 32K) BPE (16K, 32K).

Transfer Learning on Unseen Characters

Because BBPE contains all UTF-8 bytes and has no out-of-vocabulary tokens, BBPE-based models can be transferred between languages with non-overlapping character sets. In comparison, it is impossible to do so with character-based vocabularies without replacing the vocabulary and re-training embeddings from scratch. Our Si-En dataset has 77 Sinhala scripts that are disjoint with the X-En character set. We experiment transferring a pretrained (on X-En) BBPE 4K model to this dataset while reusing the original vocabulary. As shown in table 8, the transferred model gains 0.9-1.8 BLEU points compared to the baselines, suggesting the generality of pretrained BBPE embeddings and its ability to adapt to different languages with unseen characters. This transfer learning paradigm is free from the limitation of out-of-vocabulary tokens and can be very generic. We just show the extreme case of totally unseen character set, but the pre-trained model may also be transferred to any languages and datasets to improve performance or warm-start model training to save time.

Related Work

Machine Translation with Small Granularity

Previous works have shown that, instead of dealing with a huge word-based vocabulary, working on smaller granularity with a reduced vocabulary size consistently benefits the training and inference of machine translation. For example, methods based on morpheme segmentation [21, 17] have been explored in machine translation. [27] proposed to use byte-pair encoding (BPE) to unsupervisedly segment words into subword units. [14, SentencePiece] followed similar ideas of BPE with the extension of direct training from raw sentences which allows to make a purely end-to-end system that does not depend on language-specific pre-processing or post-processing. Our byte-level subwords are mostly based on the implementation of SentencePiece, while we use bytes as the basic units to compose subwords.

Existing works also explored fully character-level models for machine translation. [13] proposed to build word representations based on its characters; [7] removed the restriction of word boundary and directly learned decoding in character level;[15] further extended to a fully character NMT model, working in a multilingual setting; [5] showed that given the translation model with enough capacity, character-level models generally outperforms other subword-based models. Following the similar line, we also investigated fully byte-level models in our experiments.

Byte-based vocabulary

The closest work to ours is the byte-level BPE vocabulary used in GPT-2, a large-scale English language model 

[26]

. They however rely heavily on hard-coded merging rules and have not conducted any analysis on how their bye-level BPE impacts the quality of language modeling. A vocabulary consisting purely of bytes has previously been used in several natural language processing tasks: part-of-speech tagging and named entity recognition

[9], translation[8], machine reading [12] and speech recognition [16].

Transformer with Convolution or RNN

There are evidences for performance gains from combining Transformer with convolutional or recurrent layers in the area of NMT [3], speech recognition  [16, 19] and language modeling  [4].

Conclusion

We proposed BBPE which builds a byte-level subword vocabulary for machine translation. It results in a much more compact vocabulary than character-based ones do without the loss of performance. In multilingual settings, the former often outperforms the latter. BBPE does not have any out-of-vocabulary tokens, allowing us to transfer a model using BBPE between languages with non-overlapping vocabularies. This transfer learning paradigm is actually very generic and can be applied to any languages and datasets for performance gain or training acceleration. With the same vocabulary size, BBPE segments sentences into shorter sequences than character-based methods do, leading to faster training and inference. Our future work includes: eliminating source-target sentence length imbalance; evaluating BBPE in one-to-many and many-to-many translation settings; exploring better segmentation algorithms for byte-level subwords.

References

  • [1] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: Introduction.
  • [2] M. Cettolo, C. Girardi, and M. Federico (2012) Wit3: web inventory of transcribed and translated talks. In Conference of European Association for Machine Translation, pp. 261–268. Cited by: 2nd item.
  • [3] M. X. Chen, O. Firat, A. Bapna, M. Johnson, W. Macherey, G. Foster, L. Jones, N. Parmar, M. Schuster, Z. Chen, et al. (2018) The best of both worlds: combining recent advances in neural machine translation. arXiv preprint arXiv:1804.09849. Cited by: Transformer with Convolution or RNN.
  • [4] A. J. S. Chenguang Wang (2019) Language models with transformers. In ArXiv e-prints, Cited by: Transformer with Convolution or RNN.
  • [5] C. Cherry, G. Foster, A. Bapna, O. Firat, and W. Macherey (2018) Revisiting character-based neural machine translation with capacity and compression. arXiv preprint arXiv:1808.09943. Cited by: Machine Translation with Small Granularity.
  • [6] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734. Cited by: Encoding Byte-Level Representation.
  • [7] J. Chung, K. Cho, and Y. Bengio (2016) A character-level decoder without explicit segmentation for neural machine translation. arXiv preprint arXiv:1603.06147. Cited by: Machine Translation with Small Granularity.
  • [8] M. R. Costa-Jussà, C. Escolano, and J. A. Fonollosa (2017) Byte-based neural machine translation. In Proceedings of the First Workshop on Subword and Character Level Models in NLP, Cited by: Byte-based vocabulary.
  • [9] D. Gillick, C. Brunk, O. Vinyals, and A. Subramanya (2016) Multilingual language processing from bytes. In Proceedings of NAACL-HLT, Cited by: Byte-based vocabulary.
  • [10] F. Guzmán, P. Chen, M. Ott, J. Pino, G. Lample, P. Koehn, V. Chaudhary, and M. Ranzato (2019) Two new evaluation datasets for low-resource machine translation: nepali-english and sinhala-english. Cited by: 3rd item, Table 8.
  • [11] L. Kaiser, A. N. Gomez, and F. Chollet (2017) Depthwise separable convolutions for neural machine translation. arXiv preprint arXiv:1706.03059. Cited by: Encoding Byte-Level Representation.
  • [12] T. Kenter, L. Jones, and D. Hewlett (2018) Byte-level machine reading across morphologically varied languages. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    ,
    Cited by: Byte-based vocabulary.
  • [13] Y. Kim, Y. Jernite, D. Sontag, and A. M. Rush (2016) Character-aware neural language models. In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: Machine Translation with Small Granularity.
  • [14] T. Kudo and J. Richardson (2018) SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 66–71. Cited by: Datasets, Machine Translation with Small Granularity.
  • [15] J. Lee, K. Cho, and T. Hofmann (2017) Fully character-level neural machine translation without explicit segmentation. Transactions of the Association for Computational Linguistics 5, pp. 365–378. Cited by: Machine Translation with Small Granularity.
  • [16] B. Li, Y. Zhang, T. Sainath, Y. Wu, and W. Chan (2019) Bytes are all you need: end-to-end multilingual speech recognition and synthesis with bytes. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: Byte-based vocabulary, Transformer with Convolution or RNN.
  • [17] T. Luong, R. Socher, and C. Manning (2013)

    Better word representations with recursive neural networks for morphology

    .
    In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 104–113. Cited by: Machine Translation with Small Granularity.
  • [18] P. Michel and G. Neubig (2018) Mtnt: a testbed for machine translation of noisy text. arXiv preprint arXiv:1809.00388. Cited by: 2nd item.
  • [19] A. Mohamed, D. Okhonko, and L. Zettlemoyer (2019) Transformers with convolutional context for asr. arXiv preprint arXiv:1904.11660. Cited by: Transformer with Convolution or RNN.
  • [20] G. Neubig (2011) The Kyoto free translation task. Note: http://www.phontron.com/kftt Cited by: 2nd item.
  • [21] S. Nießen and H. Ney (2000) Improving smt quality with morpho-syntactic analysis. In Proceedings of the 18th conference on Computational linguistics-Volume 2, pp. 1081–1085. Cited by: Machine Translation with Small Granularity.
  • [22] M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli (2019) Fairseq: a fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, Cited by: Models and Learning.
  • [23] K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: Inference and Evaluation.
  • [24] M. Post (2018) A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 186–191. External Links: Link Cited by: Inference and Evaluation.
  • [25] R. Pryzant, Y. Chung, D. Jurafsky, and D. Britz (2017-10) JESC: Japanese-English Subtitle Corpus. ArXiv e-prints. External Links: 1710.10639 Cited by: 2nd item.
  • [26] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. Cited by: Byte-based vocabulary.
  • [27] R. Sennrich, B. Haddow, and A. Birch (2015) Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Cited by: Introduction, Machine Translation with Small Granularity.
  • [28] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: Introduction.
  • [29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: Encoding Byte-Level Representation, 1st item, Models and Learning, Table 5.
  • [30] Q. Ye, S. Devendra, F. Matthieu, P. Sarguna, and N. Graham (2018) When and why are pre-trained word embeddings useful for neural machine translation. In HLT-NAACL, Cited by: 4th item.