Character-based NMT with Transformer

11/12/2019 ∙ by Laurent Besacier, et al. ∙ 0

Character-based translation has several appealing advantages, but its performance is in general worse than a carefully tuned BPE baseline. In this paper we study the impact of character-based input and output with the Transformer architecture. In particular, our experiments on EN-DE show that character-based Transformer models are more robust than their BPE counterpart, both when translating noisy text, and when translating text from a different domain. To obtain comparable BLEU scores in clean, in-domain data and close the gap with BPE-based models we use known techniques to train deeper Transformer models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Character-level NMT models have some compelling characteristics. They do not suffer from out-of-vocabulary problem and avoid tedious and language-specific pre-processing that adds yet another hyper-parameter to tune. In addition, they have been reported to be more robust when translating noisy text [15] and – when using the same architecture – are more compact to store. Those two characteristics are particularly important for translating user-generated content or spoken language, which is often noisy due to transcription errors. On the drawbacks, they tend to perform worse in translation quality than using words or Byte-Pair Encoding (BPE) [20, 8] segmentations. In this paper we perform extensive experiments to measure the possibilities of character-based Transformer models in such scenarios.

For LSTM-based architectures, [10] showed that it was possible to obtain similar performance at the cost of training deeper networks. The current state-of-the-art model for NMT (and NLP in general) however is Transformer [44] and to our knowledge no equivalent study has been reported for that architecture. In this paper, we analyze the impact of character-level Transformer models versus BPE-based and evaluate them on four axes:

  • translating on clean vs noisy text,

  • in-domain vs out-of-domain conditions,

  • training in low and high-resource conditions,

  • impact of different network depths.

Our experiments are for EN-DE, on news (WMT) and TED talks (IWSLT). The results show that:

  • it is possible to narrow the gap between BPE and character-level models with deeper encoders

  • character-level models are more robust to lexicographical noise than BPE models out of the box

  • character-level models cope better with test data that is far apart from the training data set

2 Related Work

Input representations For deciding what should be the atomic input symbols, the most intuitive way seems to be to use words as tokens. Continous word representation [30, 36] have shown tremendous impact in NLP application [46, 50, 42, 11]. While some of those representations exploit morphological features [7], such representation still face challenges due to the large vocabulary needed and to out of vocabulary words. To circumvent these issues, some works learn language representations directly at the character level and disregard any notion of word segmentation. This approach is attractive due to its simplicity and ability to adapt to different languages. It has been used in a wide range of NLP tasks such as language modeling [31, 1], question answering [22] and parsing [3].

For translation, character level models in NMT initially showed unsatisfactory performance [45, 34]. The two earliest models with positive results were [29] and [12]. They compose word representations from their constituent characters and as such require an offline segmentation step to be performed beforehand. [27] was able to obviate this step by composing representations of “pseudo” words from characters using convolutional filters and highway layers [40]. The previous methods introduce special modifications to the NMT architecture in order to work at the character level. [10], on the other hand, uses a vanilla (LSTM based) NMT system to achieve superior results at the character level. In a different direction, [26] proposes to dynamically learn segmentation informed by the NMT objective. The authors discovered that their model prefers to operate on (almost) character level, providing support for purely character-based NMT.

A common approach for dealing with the open vocabulary issue is to break up rare words into sub-word units [39, 49]. BPE [39] is the standard technique in NMT and has been applied to great success to many systems [13, 33]

. BPE has one hyperparameter: number of merge operations

. The optimal depends on many factors including NMT architecture, language characteristics and size of training dataset. [13] explored several hyperparameter settings, including number of BPE merge operations, to establish strong baselines for NMT in LSTM-based architectures. They recommended “32K as a generally effective vocabulary size and 16K as a contrastive condition when building systems on less than 1 million parallel sentences”. [14] does a thorough study on the impact of for both LSTM and Transformer architectures. The authors conclude that there is in fact no optimal for LSTM; it can be very different depending on the dataset and language pair. However, for the Transformer, the best BPE size is between character level and 10k.

Deep models Until recently it was very hard to train very deep models with the standard Transformer architecture. The training dynamics tended to be unstable with degradation of performance for deeper models. [17] argued that the main culprit was the then in vogue non-linearity function: sigmoid. It saturates for deep models, blocking gradient information from flowing backward. Consequently, [32]

proposed the ReLU non-linearity, which is the

de facto

standard today. Though this simple technique allows one to train deeper models than before, it is not sufficient for very deep models with more than 30 layers. Residual connections

[19] were formulated so that the consequent layers have direct access to the layer inputs in addition to the usual forward functions. This simple tweak makes it possible to train models of up to 1 000 layers, achieving SOTA on an image classification benchmark. [4] find it hard to train the Transformer with more than 10 encoder layers. It proposes transparent attention, wherein encoder-decoder attention is computed over a linear combination of the outputs of all encoder layers. This alleviates the gradient vanishing or exploding problem and is sufficient to train Transformer with an encoder of 24 layers. [47] extends [4] and achieves slight but robust improvements.

RobustnessMachine learning systems can be brittle. Small changes to the input can lead to dramatic failures of deep learning models [41, 18]. For NMT, [5] studied robustness to lexicographical errors. They found both character and BPE to be very sensitive to such errors with severe degradation in performance (out-of-the-box robustness of character models was however slightly better than the one of BPE models). They proposed two techniques to improve robustness of NMT models: structure-invariant word representations and training on noisy texts. These techniques are sufficient to make a character based model simultaneously robust to multiple kinds of noise. [21] and [43] also report similar findings, namely that training on a balanced diet of synthetic noise can dramatically improve robustness on synthetic noise. While [43] leverage the noise distribution in the test set, [21] does not. Dealing with noisy data for NMT can also be seen as a domain adaptation problem [24]. The main discrepancy is between the distribution of the training data and test data, also known as domain shift [37]. Many different approaches have been studied to train with multiple domains: [28] and [2] include data from the target domain into the training set directly without any modifications, [23] introduce domain tags to differentiate between different domain, and finally, [52] and [9] use a topic model to add topic information about the domain during training. [6] describes the winning entry to the WMT’19 robustness challenge.

3 Representation units for Transformer

3.1 Character vs BPE models

We experimented on two language directions, namely, German-English (DE-EN) and English-German (EN-DE). For DE-EN, we consider two settings: high resource and low resource. For high resource, we concatenate the commoncrawl [38] and Europarl [25] corpora. We used the WMT 2015 news translation test set as our validation set and WMT 2016 as the test set. For the low resource setting, we used the IWSLT 2014 corpus [16], consisting of transcriptions and translations of TED Talks.111https://www.ted.com/talks We used the official train, valid and test splits. In EN-DE, we used the same setup as the low resource setting of DE-EN in the opposite direction. The IWSLT14 dataset is much smaller than the WMT corpus used originally by [44]. Therefore, for the low resource setting we use a modified version of the Transformer base architecture with approximately 50M parameters as compared to 65M for Transformer base. For the high resource setting we use the unmodified Transformer base architecture.

The training details for the low resource setting are as follows. Training is done on 4 GPUs with a max batch size of 4 000

tokens (per GPU). We train for 150 and 60 epochs in the low and high resource settings respectively, while saving a checkpoint after every epoch and average the 3 best checkpoints according to their perplexity on a validation set. In the low resource setting, we test all 6 combinations of dropout in

and learning rate in . Using the best dropout and learning rate combination, 5 models (with different random seeds) are trained. Whereas for the high resource setting, we tune dropout in and set the max learning rate to be . Due to the significantly larger computational requirements for this dataset, we only train one model.

The average performance and standard deviation (over 5 models) on test set are shown Table

1. The following conclusions can be drawn from the numbers:

Vocab Size DE-EN (low) EN-DE (low) DE-EN (high)
Char 33.7 0.1 26.7 0.1 36.3
1 000 34.0 0.2 26.8 0.1
2 000 34.4 0.2 27.1 0.1
5 000 35.0 0.0 27.4 0.1 36.2
10 000 34.6 0.2 27.6 0.1
20 000 30.5 0.2 25.3 0.1
30 000 28.3 0.1 24.3 0.1 37.2
40 000 27.1 0.1 23.7 0.2
50 000 26.2 0.3 23.1 0.2
Table 1: Impact of BPE vocab size on BLEU.
  1. [topsep=0pt, partopsep=0pt]

  2. Vocabulary matters for low resource. The impact of vocabulary size is significant in the low resource setting, BLEU scores differ by over 8 points for DE-EN and close to 5 BLEU for EN-DE. For the high resource setting, the effect of vocabulary size is minimal over a large range.

  3. Optimal BPE is small for low resource. The optimal vocabulary size is either 5 000 for DE-EN or 10 000 for EN-DE. In the high resource setting, 30K is optimal and we corroborate the standard choice.

  4. Character level models are competitive. Though the character-level models are not able to beat the best BPE models, they are surprisingly competitive without any modifications to the architecture.

3.2 Noisy vs clean

We introduce the following four different types of character level synthetic noise with an associated noise probability

. We respect word boundaries by only applying noise within the word.

  1. [topsep=0pt, partopsep=0pt]

  2. delete. Randomly delete a character except for punctuation or space.

  3. insert. Insert a random character.

  4. replace. Replace the current character,

  5. switch. Switch the position of two consecutive characters. We do not apply this for first and last character of a word.

  6. all. With a probability of , introduce one of the noises listed above.

For DE-EN, we also experiment with natural noise. We follow [5] and use their dataset of naturally occurring noise in German.222We accessed the dataset from https://github.com/ybisk/charNMT-noise/blob/master/noise/de.natural. It combines two projects: RWSE Wikipedia Revision Dataset [51] and the MERLIN corpus of language learners [48]. These corpora were created to measure spelling difficulty and consist of word lists, wherein a correct German word has associated with it a list of common mistakes. For example word “Familie” can be replaced in our natural test set by “Famielie”, “Fammilie”, etc.

For each noise type, we create ten different noisy versions of the test set with different noise probabilities. For synthetic noise, noise proportions were , , , , whereas for natural noise, noise proportions were , , , . Note that the natural noise test set does not have all its tokens transformed. A majority of words have no naturally occurring spelling error in [5]’s dataset. We then compute the BLEU test score on that noisy test data, for each and for models trained with different vocabulary sizes. A representative such plot can be seen in Fig. 1, for the case of using insertion and where each line corresponds to one vocabulary size.

Figure 1: Degradation of translation quality with increasing noise (character insertion). The slope of the curves (sensibility) is smaller and shows character level is more robust here.

We calculate the BLEU scores on noisy test sets with different noise probabilities and for each data series, we compute a linear regression:

(1)

where is the noise probability, the slope is the “sensitivity” of the NMT system to that type of noise and is the intercept. Closer to 0 means that the system is more robust to that kind of noise, and a value of indicates that for each additional percentage point of noise the system loses 1 BLEU point. Those values can be seen in Fig. 2 where we plot the values of vs the vocabulary size. We conclude from that:

(a) DE-EN (low)
(b) EN-DE (low)
(c) DE-EN (high)
Figure 2: Noise sensitivity vs vocab size for models trained on clean data. Character-level models are shown with zero vocabulary size. Sensitivity values closer to zero mean that the model is more robust to that kind of noise.
Figure 3: BLEU scores for DE-EN models trained and tested on different noise conditions. The first (orange) column refers to training on clean data and testing on noised data; the second (blue) trained and tested on matched noise (which is the same for the clean group)
Figure 4: BLEU scores for EN-DE models trained and tested on different noise conditions. The first (orange) column refers to training on clean data and testing on noised data; the second (blue) trained and tested on matched noise (which is the same for the clean group)
  1. Degradation with noise. Out of the box, both BPE and character level models are very sensitive to lexicographic noise. BPE models lose as much as 2 BLEU points for each percentage increase in noise, whereas character level models lose as much as 1.5 BLEU.

  2. Behaviour of different noises. BPE models are roughly equally sensitive to all kinds of synthetic noise. Character level models are more sensitive to certain kinds of noises than other. They are relatively very robust to switch, approximately equally robust to delete and insert and least robust to replace. We hypothesize that this could be due to switch only changing the positional encodings locally. The content embeddings remain intact. In contrast, replace preserves positional encodings but changes the content embeddings. Sensitivity to natural noise is much smaller than to synthetic noise overall, probably due to the fact that increasing the noise level does not have any effect on words that are not listed in the [5] dataset.

  3. Character level models are more robust. For each kind of noise, character level models are less sensitive than all of the BPE models. They are particularly robust to switch, where they are more than twice as robust as the best performing BPE models. Though character level models start out at a worse footing than the best BPE models, after applying only 1-2 % (for synthetic noise) of noise in the test set, character level models perform better.

We also experiment with a simple method to robustify models to the noise. We introduce the aforementioned noises into the training data as well. Thereafter, we test on both clean and noisy data. In consideration for time and computational resources, we choose two representational BPE vocabularies – 5 000 for small vocab size and 30 000 for large vocab size. We also train character level models with noisy data. We only consider synthetic character level noises with noisy probability set to in the low resource setting. The results are shown in Fig. 3 and 4 for DE-EN and EN-DE respectively. Each group of two represents training on clean/matched noise; for 3 different vocabulary sizes and 6 different types of noise. The full results of these sets of experiments are in Appendix B. The following conclusions are apparent.

  1. Adding noise helps. Training on similar type of noisy data improves performance for all vocabularies.333We also observed that certain kinds of noise also improve robustness for other noises (results with unmatched train and test conditions not reported here).

  2. BPE is as robust as character. By training on similar kinds of noise in the training data, we are able to robustify BPE models to the same level as character level models without sacrificing too much performance on the clean test set.

  3. Effect on clean data. We observed (see Tables 6 and 7 in the Appendix) that for small vocabularies (character-level and BPE 5 000), training with noise in training data had a small detrimental effect when testing on clean data. However, in the case of BPE 30 000, training on noisy data significantly boosted performance (eg, improvement of 6 BLEU for DE-EN and 1.7 for EN-DE when training with delete and testing on clean data). We hypothesize that the increased diversity of tokens during training (due to the presence of noise) acts as a regularizer boosting performance on the test set.

3.3 In-domain vs out-of-domain

We test the low and high resource models on the following in and out of domain datasets:

  1. [topsep=0pt, partopsep=0pt]

  2. newstest 2016. News text from WMT 2016 news translation task

  3. WMT Biomedical. Medline abstracts from WMT 2018 biomedical translation task.

  4. WMT-IT. Hardware and software troubleshooting answers from WMT 2016 IT domain translation task.

  5. Europarl. The first 3 000 sentences from Europarl corpus [25]. Proceedings of the European Parliament.

  6. commoncrawl. The first 3 000 sentences from commoncrawl parallel text corpus [38].

DE-EN EN-DE DE-EN (high)
Dataset # Sents % Unseen PPL % Unseen PPL % Unseen PPL
IWSLT14 6 750 4.4 583 2.2 282 2.0 740
WMT-IT 2 000 14.4 2 540 13.2 2 322 5.8 996
WMT-Bio. 321 20.0 5 540 12.1 3 035 9.2 3 404
newstest 2016 2 999 12.7 2,712 9.0 1,659 4.5 1 703
Europarl 3 000 9.0 1,765 4.4 771 0 10
commoncrawl 3 000 17.8 5 024 12.6 2 711 0 9
avg 3 011 13.0 3 022 8.9 1,797 3.6 1 144
Table 2: Similarity metrics between test sets and training sets.

We provide two similarity metrics between the training and test sets in Tables 2. “% Unseen” is the percentage of words in the test set that are not present in the training corpus. “PPL” is the perplexity measure of the test set using a language model trained on the training data. We used the kenlm444https://github.com/kpu/kenlm toolkit with Kneser-Ney smoothing [35] and context size of 4.

We show results in Figure 5 and conclude the following:

  1. DE-EN low resource. Character level models are better for all out of domain datasets, except for Europarl. Recall from Table 2 that Europarl also has the least proportion of unseen words. This suggests that character level models outperform BPE when evaluated on data sufficiently different from the training domain in this low resource setting.

  2. DE-EN high resource. Character level models are now only better when testing on the WMT-Biomedical test set. We see from Table 2 that it also has the largest proportion of unseen words. For all other test sets, BPE 30 000 leads to the best BLEU scores.

  3. EN-DE low resource. We see similar performance for in and out of domain data. Good BPE models on in-domain test set are still better on out-of-domain test sets. A possible explanation is the lower proportion of unseen words as compared to German and seeing the words more frequently in the training corpus.

(a) DE-EN (low)
(b) EN-DE (low)
(c) DE-EN (high)
Figure 5: BLEU scores for different vocabularies on test sets from different domains.

3.4 Deeper character-based Transformers

For other architectures, training deeper models had a very positive impact on character-based translations [10], however similar studies have not been reported using the Transformer. Due to computational constraints, we experiment only on DE-EN language pair in the low and high resource settings and train only one model for each configuration.

Figure 6: Degradation observed with the standard Transformer architecture when going from 10 to 12 layers.

3.4.1 Low resource

We train models from 6 to 16 encoder layers for character level and BPE 5 000, the best performing vocabulary size in our preliminary experiments. We fix learning rate to and tune dropout in . First, we do not perform any modifications to the Transformer architecture. In particular, this means that layer normalization takes place after each sub layer. To train deeper models, following [47], we place the layer normalization step before each layer, and also experiment with transparent attention [4].

In contrast to [10], we see a degradation of performance with increasing depth for post-normalization (Figure 6 illustrates this with the standard Transformer architecture when going from 10 to 12 layers), but the simple trick of switching the sequence of performing layer normalization is sufficient to train models with up to 32 layers in the encoder. We therefore report only results using pre-normalization in Table 3. While adding transparent attention is beneficial for almost all depths in the character level models, it gives mixed results for the BPE 5 000 model. Hence, we see that by training deeper models we are able to marginally improve performance for both vocabularies. For character level models we improve by 1 BLEU points (from 33.7 to 34.7). For 5 000, the gain is a more modest 0.4 BLEU points (from 35 to 35.4). We are able to narrow but not close the gap between character and BPE.

Char 5 000
enc PreN PreN+T PreN PreN+T
6 33.4 33.2 34.6 34.6
12 33.8 34.5 34.8 34.8
16 33.5 34.5 35.2 34.9
20 34.4 34.7 35.3 34.9
24 34.1 34.7 35.1 35.4
28 34.4 34.3 35 35
32 34.1 34.5 34.7 35.2
Table 3: Results for the low resource setting. “PreN + T” refers to an architecture with layer normalization before each sub-layer and transparent attention. Best results for each layer depth are shown in bold.

3.4.2 High resource

In light of aforementioned experiments, we no longer train models with post layer normalization and restrict ourselves to pre layer normalization and transparent attention. Here, we also experiment with the BPE 30 000 vocabulary. The results are shown in Table 4. Here again we see an improvement in BLEU scores with increasing depth of 1-2 points when going beyond 6 encoder layers. Transparent attention seems to help consistently for character level models but barely does anything for the two BPE models. Further, with increased depth, BPE 5 000 and 30 000 perform similarly in contrast to shallow models where there is a 1 BLEU difference. However, character level models are still slightly worse than the BPE models with a max score of 37.7 rather than 38 for the BPE models.

Char 5 000 30 000
enc PreN PreN + T PreN PreN+T PreN PreN+T
6 36.3 36.5 36.2 36.4 37.2 36.9
12 36.8 37.5 37.1 36.9 37.4 37.5
18 37.3 37.4 37.6 37.7 37.9 37.8
24 37.7 37.7 37.6 37.6 37.8 37.8
32 37.2 37.4 37.9 38 missing38 37.9
Table 4: Results for the high resource setting.

4 Discussion and recommendations

Vocabulary Size. We observed that in the low resource setting, BPE vocab size can be a very important parameter to tune, having a large impact on BLEU. However, the effect vanishes for the high resource setting, where performance is similar for a large range of vocabulary sizes. Character level models also tend to be competitive with BPE.

Lexicographical noise. When trained on clean data, character-based models are more robust to natural and synthetic lexicographical noise than BPE-based models (these results confirm a trend already observed in [5]), however the trend fades away when similar kind of noise is introduced in the training data as well. Surprisingly, we observed that noise on training data might be acting as a regularizer for the large BPE vocabularies (breaking up large tokens into smaller ones) and improves results on clean inputs.

Domain shift. Better results on 4 datasets over 5 with character based models for DE-EN; character level models outperform BPE when evaluated on data sufficiently different from the training domain in low resource; no significant differences for EN-DE. In DE-EN (high resource), character level models are only better when testing on the WMT-Biomedical test set which has the largest proportion of unseen words.

Deep models. In contrast to Cherry et al. (2018), we see a degradation of performance with increasing depth without any modification of the Transformer architecture. We can train deeper and more efficient character-based Transformers by switching the sequence of performing layer normalization. Doing so, we can train models with up to 32 layers in the encoder. We are able to narrow but not close the gap between character and BPE. These tricks may also hold for other use cases where longer input sequences are needed (for instance: document level NMT).

5 Conclusion

In this work, we have studied the characteristics of different representation units in NMT including character-level models and BPE models with different vocabulary sizes. We observed that different representations can have very different behaviours with distinct advantages and disadvantages. In the future, we would like to investigate methods to combine different representations in order to get the best of all worlds.

References

  • [1] R. Al-Rfou, D. Choe, N. Constant, M. Guo, and L. Jones (2018) Character-level language modeling with deeper self-attention. CoRR abs/1808.04444. External Links: Link, 1808.04444 Cited by: §2.
  • [2] A. Axelrod, X. He, and J. Gao (2011) Domain adaptation via pseudo in-domain data selection. In

    Proceedings of the Conference on Empirical Methods in Natural Language Processing

    ,
    EMNLP ’11, Stroudsburg, PA, USA, pp. 355–362. External Links: ISBN 978-1-937284-11-4, Link Cited by: §2.
  • [3] M. Ballesteros, C. Dyer, and N. A. Smith (2015) Improved transition-based parsing by modeling characters instead of words with lstms. CoRR abs/1508.00657. External Links: Link, 1508.00657 Cited by: §2.
  • [4] A. Bapna, M. X. Chen, O. Firat, Y. Cao, and Y. Wu (2018)

    Training deeper neural machine translation models with transparent attention

    .
    CoRR abs/1808.07561. External Links: Link, 1808.07561 Cited by: §2, §3.4.1.
  • [5] Y. Belinkov and Y. Bisk (2017) Synthetic and natural noise both break neural machine translation. CoRR abs/1711.02173. External Links: Link, 1711.02173 Cited by: §2, item 2, §3.2, §3.2, §4.
  • [6] A. Bérard, I. Calapodescu, and C. Roux (2019) Naver labs europe’s systems for the wmt19 machine translation robustness task. arXiv preprint arXiv:1907.06488. Cited by: §2.
  • [7] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov (2017)

    Enriching word vectors with subword information

    .
    Transactions of the Association for Computational Linguistics 5, pp. 135–146. Cited by: §2.
  • [8] J. Bradbury, S. Merity, C. Xiong, and R. Socher (2016)

    Quasi-recurrent neural networks

    .
    CoRR abs/1611.01576. External Links: Link, 1611.01576 Cited by: §1.
  • [9] W. Chen, E. Matusov, S. Khadivi, and J. Peter (2016) Guided alignment training for topic-aware neural machine translation. CoRR abs/1607.01628. External Links: Link, 1607.01628 Cited by: §2.
  • [10] C. Cherry, G. Foster, A. Bapna, O. Firat, and W. Macherey (2018) Revisiting character-based neural machine translation with capacity and compression. CoRR abs/1808.09943. External Links: Link, 1808.09943 Cited by: §1, §2, §3.4.1, §3.4.
  • [11] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes (2017) Supervised learning of universal sentence representations from natural language inference data. CoRR abs/1705.02364. External Links: Link, 1705.02364 Cited by: §2.
  • [12] M. R. Costa-jussà and J. A. R. Fonollosa (2016) Character-based neural machine translation. CoRR abs/1603.00810. External Links: Link, 1603.00810 Cited by: §2.
  • [13] M. J. Denkowski and G. Neubig (2017) Stronger baselines for trustable results in neural machine translation. CoRR abs/1706.09733. External Links: Link, 1706.09733 Cited by: §2.
  • [14] S. Ding, A. Renduchintala, and K. Duh (2019) A call for prudent choice of subword merge operations. CoRR abs/1905.10453. External Links: Link, 1905.10453 Cited by: §2.
  • [15] N. Durrani and P. Nakov (2018) What is in a translation unit? comparing character and subword representations beyond translation. External Links: Link Cited by: §1.
  • [16] M. Federico, S. Stücker, and F. Yvon (2014-12) International workshop onspoken language translation. In Proceedings of theInternational Workshop onSpoken Language Translation, Lake Tahoe, CA, USA. Cited by: §3.1.
  • [17] X. Glorot and Y. Bengio (2010-13–15 May) Understanding the difficulty of training deep feedforward neural networks. In

    Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics

    , Y. W. Teh and M. Titterington (Eds.),
    Proceedings of Machine Learning Research, Vol. 9, Chia Laguna Resort, Sardinia, Italy, pp. 249–256. External Links: Link Cited by: §2.
  • [18] I. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • [19] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. CoRR abs/1512.03385. External Links: Link, 1512.03385 Cited by: §2.
  • [20] B. Heinzerling and M. Strube (2017) BPEmb: tokenization-free pre-trained subword embeddings in 275 languages. CoRR abs/1710.02187. External Links: Link, 1710.02187 Cited by: §1.
  • [21] V. Karpukhin, O. Levy, J. Eisenstein, and M. Ghazvininejad (2019) Training on synthetic noise improves robustness to natural noise in machine translation. CoRR abs/1902.01509. External Links: Link, 1902.01509 Cited by: §2.
  • [22] T. Kenter, L. Jones, and D. Hewlett (Eds.) (2018) Byte-level machine reading across morphologically varied languages. External Links: Link Cited by: §2.
  • [23] C. Kobus, J. M. Crego, and J. Senellart (2016) Domain control for neural machine translation. CoRR abs/1612.06140. External Links: Link, 1612.06140 Cited by: §2.
  • [24] P. Koehn and R. Knowles (2017-08) Six challenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation, Vancouver, pp. 28–39. External Links: Link, Document Cited by: §2.
  • [25] P. Koehn (2004-11) EuroParl: a parallel corpus for statistical machine translation. 5, pp. . Cited by: item 4, §3.1.
  • [26] J. Kreutzer and A. Sokolov (2018) Learning to segment inputs for NMT favors character-level processing. CoRR abs/1810.01480. External Links: Link, 1810.01480 Cited by: §2.
  • [27] J. Lee, K. Cho, and T. Hofmann (2016) Fully character-level neural machine translation without explicit segmentation. CoRR abs/1610.03017. External Links: Link, 1610.03017 Cited by: §2.
  • [28] M. Li, Y. Zhao, D. Zhang, and M. Zhou (2010-08) Adaptive development data selection for log-linear model in statistical machine translation. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), Beijing, China, pp. 662–670. External Links: Link Cited by: §2.
  • [29] W. Ling, I. Trancoso, C. Dyer, and A. W. Black (2015) Character-based neural machine translation. CoRR abs/1511.04586. External Links: Link, 1511.04586 Cited by: §2.
  • [30] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), pp. 3111–3119. Cited by: §2.
  • [31] T. Mikolov, I. Sutskever, A. Deoras, H. Le, S. Kombrink, and J. Cernocký (2011) SUBWORD language modeling with neural networks. Cited by: §2.
  • [32] V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, USA, pp. 807–814. External Links: ISBN 978-1-60558-907-7, Link Cited by: §2.
  • [33] T. Nakazawa and I. Goto (2017-11) Proceedings of the 4th workshop on Asian translation (WAT2017). Asian Federation of Natural Language Processing, Taipei, Taiwan. External Links: Link Cited by: §2.
  • [34] G. Neubig, T. Watanabe, S. Mori, and T. Kawahara (2013-06) Substring-based machine translation. Machine Translation 27 (2), pp. 139–166. External Links: ISSN 0922-6567, Link, Document Cited by: §2.
  • [35] H. Ney, U. Essen, and R. Kneser (1994) On structuring probabilistic dependencies in stochastic language modelling. Computer Speech and Language 8, pp. 1–38. Cited by: §3.3.
  • [36] J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. In In EMNLP, Cited by: §2.
  • [37] J. Quionero-Candela, M. Sugiyama, A. Schwaighofer, and N. D. Lawrence (2009) Dataset shift in machine learning. The MIT Press. External Links: ISBN 0262170051, 9780262170055 Cited by: §2.
  • [38] J. R. Smith, H. Saint-Amand, M. Plamada, P. Koehn, C. Callison-Burch, and A. Lopez (2013-07) Dirt cheap web-scale parallel text from the common crawl. Vol. 1, pp. . External Links: Document Cited by: item 5, §3.1.
  • [39] R. Sennrich, B. Haddow, and A. Birch (2015) Neural machine translation of rare words with subword units. CoRR abs/1508.07909. External Links: Link, 1508.07909 Cited by: §2.
  • [40] R. K. Srivastava, K. Greff, and J. Schmidhuber (2015) Highway networks. CoRR abs/1505.00387. External Links: Link, 1505.00387 Cited by: §2.
  • [41] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • [42] M. Tan, B. Xiang, and B. Zhou (2015) LSTM-based deep learning models for non-factoid answer selection. CoRR abs/1511.04108. External Links: Link, 1511.04108 Cited by: §2.
  • [43] Vaibhav, S. Singh, C. Stewart, and G. Neubig (2019) Improving robustness of machine translation with synthetic noise. CoRR abs/1902.09508. External Links: Link, 1902.09508 Cited by: §2.
  • [44] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. CoRR abs/1706.03762. External Links: Link, 1706.03762 Cited by: §1, §3.1.
  • [45] D. Vilar, Jan-T. Peter, and H. Ney (2007) Can we translate letters?. In Proceedings of the Second Workshop on Statistical Machine Translation, StatMT ’07, Stroudsburg, PA, USA, pp. 33–39. External Links: Link Cited by: §2.
  • [46] P. Wang, Y. Qian, F. K. Soong, L. He, and H. Zhao (2015)

    Part-of-speech tagging with bidirectional long short-term memory recurrent neural network

    .
    CoRR abs/1510.06168. External Links: Link, 1510.06168 Cited by: §2.
  • [47] Q. Wang, B. Li, T. Xiao, J. Zhu, C. Li, D. F. Wong, and L. S. Chao (2019) Learning deep transformer models for machine translation. CoRR abs/1906.01787. External Links: Link, 1906.01787 Cited by: §2, §3.4.1.
  • [48] K. Wisniewski, K. Schöne, L. Nicolas, C. Vettori, A. Boyd, D. Meurers, A. Abel, and J. Hana (2013-10) MERLIN: an online trilingual learner corpus empirically grounding the european reference levels in authentic learner data. pp. . Cited by: §3.2.
  • [49] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. CoRR abs/1609.08144. External Links: Link, 1609.08144 Cited by: §2.
  • [50] I. Yamada, H. Shindo, H. Takeda, and Y. Takefuji (2017) Learning distributed representations of texts and entities from knowledge base. CoRR abs/1705.02494. External Links: Link, 1705.02494 Cited by: §2.
  • [51] T. Zesch (2012-04) Measuring contextual fitness using error contexts extracted from the Wikipedia revision history. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, pp. 529–538. External Links: Link Cited by: §3.2.
  • [52] J. Zhang, L. Li, A. Way, and Q. Liu (2016-12) Topic-informed neural machine translation. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, pp. 1807–1817. External Links: Link Cited by: §2.

Appendix A Training Details

Hyperparameter Transformer Base Our Version
Encoder embedding dimension 512 512
Encoder fully forward embedding dimension 2048 1024
Encoder layers 6 6
Encoder attention heads 8 4
Decoder embedding dimension 512 512
Decoder fully forward embedding dimension 2048 1024
Decoder layers 6 6
Decoder attention heads 8 4
Share encoder-decoder embeddings X
Table 5: Hyperparameter Settings for the modified Transformer architecture.

The training details for the low resource setting are as follows. Training is done on 4 GPUs with a max batch size of 4000 tokens (per GPU). We train for 150 epochs, while saving a checkpoint after every epoch and average the 3 best checkpoints according to their perplexity on a validation set. For each setting, we test all 6 combinations of dropout in and learning rate in . Using the best dropout and learning rate combination, 5 models (with different random seeds) are trained.

The training details for the high resource setting are as follows. Training is done on 4 GPUs with a max batch size of 3,500 tokens (per GPU). We train for 60 epochs, while saving a checkpoint after every epoch and average the 3 best checkpoints according to their perplexity on a validation set. For each setting, we test all 3 combinations of dropout in and set the max learning rate to be . Due to the significantly larger computational requirements for this dataset, we only train one model for each configuration.

Appendix B Robustness to Noise

(a) delete
(b) insert
(c) replace
(d) switch
(e) all
(f) natural
Figure 7: Degradation of translation quality with lexicographical noise for model trained on DE-EN language pair in the low resource setting. Character level model is shown in light green (best viewed in color).
(a) delete
(b) insert
(c) replace
(d) switch
(e) all
Figure 8: Degradation of translation quality with lexicographical noise for model trained on EN-DE language pair in the low resource setting. Character level model is shown in light green (best viewed in color).
(a) delete
(b) insert
(c) replace
(d) switch
(e) all
(f) natural
Figure 9: Degradation of translation quality with lexicographical noise for model trained on DE-EN language pair in the high resource setting. Character level model is shown in light green (best viewed in color).

We include here the full tables of training and testing on clean and noised datasets. A summary of those tables can be found in Figure 3 and 4 in the main part of the paper.

train \test clean all delete insert replace switch avg
clean 34.1 27.6 27.9 27.7 25.4 30.6 28.9
all 32.7 33.1 30.4 32.1 28.1 32.4 31.5
delete 33.0 28.1 30.7 25.9 24.6 30.6 28.8
insert 32.7 28.8 25.5 32.4 25.7 30.2 29.2
replace 32.9 29.7 28.2 30.3 31.1 30.2 30.4
switch 33.1 28.4 26.8 28.0 24.8 32.9 29.0
Character
train \test clean all delete insert replace switch avg
clean 35.0 23.5 24.9 23.5 22.7 25.0 25.8
all 34.4 32.9 32.0 33.4 31.6 33.6 33.0
delete 34.3 29.6 32.7 27.3 25.5 31.1 30.1
insert 34.4 30.9 29.1 33.8 30.4 30.8 31.6
replace 34.0 30.7 29.2 32.1 31.9 29.9 31.3
switch 34.1 29.3 29.2 28.0 25.2 33.7 29.9
5,000
train \test clean all delete insert replace switch avg
clean 28.2 17.3 18.2 17.0 17.9 19.0 19.6
all 34.0 32.0 31.5 32.3 30.2 32.9 32.2
delete 34.2 28.1 32.5 25.5 24.2 29.3 29.0
insert 34.2 30.4 28.8 32.9 29.6 29.7 30.9
replace 33.8 30.4 28.8 31.5 30.9 29.6 30.8
switch 34.2 28.0 28.2 25.1 23.2 33.7 28.7
30,000
Table 6: BLEU scores for DE-EN models trained and tested on different noises.
train \test clean all delete insert replace switch avg
clean 26.9 21.2 21.3 21.4 19.3 22.5 22.1
all 25.7 25.3 24.1 26.2 24.2 25.5 25.2
delete 25.3 22.0 24.9 20.2 19.8 23.2 22.6
insert 25.9 22.8 20.7 26.5 21.1 22.7 23.3
replace 26.0 23.5 21.9 24.3 25.3 22.8 24.0
switch 25.6 22.1 21.9 21.5 19.5 25.9 22.8
Character
train \test clean all delete insert replace switch avg
clean 27.4 17.6 18.0 17.5 17.2 17.4 19.2
all 25.9 25.4 24.7 26.0 24.9 25.5 25.4
delete 26.1 22.1 25.5 20.1 19.8 22.9 22.8
insert 26.2 23.7 21.9 26.6 23.4 22.6 24.1
replace 26.0 24.1 22.1 25.5 25.4 22.5 24.3
switch 26.3 22.4 22.6 20.9 19.9 26.2 23.1
5,000
train \test clean all delete insert replace switch avg
clean 24.5 14.9 15.4 14.3 14.5 14.8 16.4
all 26.2 25.2 25.1 25.7 24.6 25.7 25.4
delete 26.2 21.6 25.6 19.7 19.3 22.1 22.4
insert 26.3 23.8 21.9 26.2 23.7 22.4 24.1
replace 26.2 23.8 22.1 24.9 25.2 22.4 24.1
switch 26.6 21.7 21.5 19.5 19.0 26.4 22.5
30,000
Table 7: BLEU scores for EN-DE models trained and tested on different noises.