While achieving state-of-the-art results, it is a common constraint that Neural Machine Translation (NMT) [Sutskever et al.2014, Bahdanau et al.2015, Luong et al.2015, Vaswani et al.2017] systems are only capable of generating a closed set of symbols. Systems with large vocabulary sizes are too hard to fit onto GPU for training, as the word embedding is generally the most parameter-dense component in the NMT architecture. For that reason, subword methods, such as Byte-Pair Encoding (BPE) [Sennrich et al.2016], is very widely used for building NMT systems. The general idea of these methods is to exploit the pre-defined vocabulary space optimally by performing a minimum amount of word segmentations in the training set.
However, very few existing literature carefully examines what is the best practice regarding application of subword methods. As hyper-parameter search is expensive, there is a tendency to simply use existing recipes. This is especially true for the number of merge operations when people are using BPE, although this configuration is closely correlated with the granularity of the segmentation on the training corpus, thus having direct influence on the final system performance. Prior to this work, DBLP:conf/aclnmt/DenkowskiN17 recommended 32k BPE merge operation in their work on trustable baselines for NMT, while DBLP:conf/emnlp/CherryFBFM18 contradicted their study by showing that character-based models outperform 32k BPE. Both of these studies are based on the LSTM-based architectures [Sutskever et al.2014, Bahdanau et al.2015, Luong et al.2015]. To the best of our knowledge, there is no work that look into the same problem for the Transformer architecture extensively.111For reference, the original Transformer paper by DBLP:conf/nips/VaswaniSPUJGKP17 used BPE merge operations that resulted in 37k joint vocabulary size.
In this paper, we aim to provide guidance for this hyper-parameter choice by examining the interaction between MT system performance with the choice of BPE merge operations under the low resource setting
. We conjecture that lower resource systems will be more prone to the performance variance introduced by this choice, and the effect might vary with the choice of model architectures and languages. To verify this, we conduct experiments with 5 different architecture setup on 4 language pairs of IWSLT 2016 dataset. In general, we discover that there is no typical optimal choice of merge operations for LSTM-based architectures, but for Transformer architectures, the optimal choice lays between 0-4k, and systems using the traditional 32k merge operations could lose as much as 4 points in BLEU score compared to the optimal choice.
2 Related Work
Currently, the most common subword methods are BPE [Sennrich et al.2016], wordpiece [Wu et al.2016] and subword regularization [Kudo2018]. Subword regularization introduces Bayesian sampling method to incorporate more segmentation variety into the training corpus, thus improving the systems’ ability to handle segmentation ambiguity. Yet, the effect of such method is not very thoroughly tested. In this work we will focus on the BPE/wordpiece method. Because the two methods are very similar, throughout the rest of the paper, we will refer to the BPE/wordpiece method as BPE method unless otherwise specified.
To the best of our knowledge, no prior work systematically reports findings for a wide range of systems that cover different architectures and both directions of translation for multiple language pairs. While some work has conducted some experiments which different BPE settings, they are generally very limited in the range of configurations explored. For example, DBLP:conf/acl/SennrichHB16a, the original paper that proposed the BPE method, compared the system performance when using 60k separate BPE and 90k joint BPE. They found 90k to work better and used that for their subsequent winning WMT 2017 new translation shared task submission [Sennrich et al.2017]
. DBLP:journals/corr/WuSCLNMKCGMKSJL16, on the other hand, found 8k–32k merge operations achieving optimal BLEU score performance for the wordpiece method. DBLP:conf/aclnmt/DenkowskiN17 explored several hyperparameter settings, including number of BPE merge operations, to establish strong baseline for NMT on LSTM-based architectures. While DBLP:conf/aclnmt/DenkowskiN17 showed that BPE models are clearly better than word-level models, their experiments on 16k and 32k BPE configuration did not show much differences. They therefore recommended that “32K as a generally effective vocabulary size and 16K as a contrastive condition when building systems on less than 1 million parallel sentences”. However, while studying deep character-based LSTM-based translation models, DBLP:conf/emnlp/CherryFBFM18 also ran experiments for BPE configurations between 0–32k, and found that the system performance deteriorates with the increasing number of BPE merge operations. Recently, DBLP:journals/corr/abs-1809-02223 also showed that it is important to tune the number of BPE merge operations and found no typical optimal BPE configuration for their LSTM-based architecture when experimented over several language pairs in a low-resource setting. It should be noticed that the results from the above studies actually contradict with each other, and there is still no clear consensus as to what is the best practice for BPE application. Moreover, all the work surveyed above was done with LSTM-based architectures. To this day, we are not aware of any work that explored the interaction of BPE with the Transformer architecture.
To give the readers a better landscape of the current practice, we gathered all 44 papers that have been accepted by the research track of Conference of Machine Translation (WMT) through 2017 and 2018. We count different configurations used in a single paper as separate data points. Hence, after removing 8 papers for which BPE is irrelevant, we still manage to obtain 42 data points, shown in Figure 1. It first came to our attention that 32k and 90k are the most commonly chosen BPE merge operations for these papers, followed by 30k, 40k and 16k. However, after closer examination, we realized that most papers that used 90k were following the configuration in DBLP:conf/wmt/SennrichBCGHHBW17, the winning NMT system in the WMT new translation shared task, but this setup has somehow become less popular in 2018. On the other hand, although we are unable to confirm a clear trend-setter, 32k and its closer-by numbers such as 30k and 40k always seem to be a common choice. All of the above survey supports our initial claim that we as a community has not yet systematically investigated the entire range of BPE merge operations used in our experiments.
3 Analysis Setup
Our goal is to compare the impact of different numbers of BPE merge operations on multiple language pairs and multiple NMT architectures. We experiment with the following BPE merge operation setup: 0 (character-level), 0.5k, 1k, 2k, 4k, 8k, 16k, and 32k, on both translation directions of 4 language pairs and 5 architectures. Additionally, we include 6 more language pairs (with 2 architectures) to study the interaction between linguistic attributes and BPE merge operations.
Our experiments are conducted with the all the data from IWSLT 2016 shared task, covering translation of English (en) from and into Arabic (ar), Czech (cs), French (fr) and German (de). For each language pair, we concatenate all the dataset marked as dev as our development set and those marked as test as our test set. To increase language coverage, we also conducted extra experiments with 6 more language pairs in 58-language-to- English TED corpus [Qi et al.2018]. The 6 extra language pairs are also translating either into or out of English, covering Brazilian Portuguese (pt), Hebrew (he), Russian (ru), Turkish (tr), Polish (pl) and Hungarian (hu). All the data are tokenized and truecased using the accompanying script from Moses decoder [Koehn et al.2007] before training and applying BPE models.222Data processing scripts will be released soon.
We use subword-nmt333https://pypi.org/project/subword-nmt/0.3.5/ to train and apply BPE to our data. Unless otherwise specified, all of our BPE models are trained on the concatenation of the source and target training corpus, i.e. the joint BPE in DBLP:conf/acl/SennrichHB16a. We use SacreBLEU [Post2018] to compute BLEU score. The signature is BLEU+case.mixed+numrefs.1+ smooth.exp+tok.13a+version.1.2.12.
We build our NMT system with fairseq [Ott et al.2019]. We use two pre-configured architectures in fairseq for our study, namely lstm-wiseman-iwslt-de-en (referred to as tiny-lstm) and trans- former-iwslt-de-en (referred to as deep- transformer), which are the model architecture tuned for their benchmark system trained on IWSLT 2014 German-English data. However, we find (as can be seen from Table 1) that the number of parameters in lstm-tiny is a magnitude lower than deep-transformer mainly due to the fact that the former has a single-layer uni-directional encoder and a single-layer decoder, while the later has encoder and decoder layers. For a fairer comparison we include a deep-lstm architecture with encoder and decoder layers which roughly matches the number of parameters in deep-transformer. To study the effect of BPE on relatively smaller architectures, we also include shallow-transformer and shallow-lstm architectures, both with encoder and decoder layers. The shallow-lstm also use bidirectional LSTM layers in the encoder. These two architectures also roughly match each other in terms of number of parameters. With these architectures, we believe we have covered a wide range of common choices in NMT architectures, especially in low-resource settings.
Per the experiment setting in DBLP:conf/nips/VaswaniSPUJGKP17, we apply label smoothing with for all of our Transformer experiments. We use Adam optimizer [Kingma and Ba2014] for all the experiments we run. For Transformer experiments, we use the learning rate scheduling settings in DBLP:conf/nips/VaswaniSPUJGKP17, including the inverse square root learning rate scheduler, 4000 warmup updates and initial warmup learning rate of
. For most LSTM experiments, we just use learning rate 0.001 from the start and reduce the learning rate by half every time the loss function fails to improve on the development set. However, we find that forlstm-deep architecture, such learning rate schedule tends to be unstable, which is very similar to training Transformer without the warmup learning rate schedule. Applying the same warmup schedule as Transformer experiments works for most lstm-deep architecture except for de-en experiments as BPE size 16k and 32k, for which we have to apply 8000 warmup updates.
4.1 Analysis 1: Architectures
Table 2 shows the BLEU score for Transformer systems with BPE merge operations ranging from 0 to 32k. The Transformer experiments show a clear trend; large BPE settings of k-k are not optimal for low-resource settings. We see that regardless of the direction of translation the best BLEU score for Transformer based architectures are somewhere in the -k range. While the best setting tends to be -k there is not much drop for k-k. However, there is generally a drastic performance drop as BPE merge operations was increased beyond k.
It should also be noted that the difference between the best and the worst performance is around 3 BLEU points (refer to the column in Table 2), larger than the improvements claimed in many machine translation papers.
|Char||Separate BPE||Joint BPE|
Table 3 shows the BLEU score for LSTM based architectures trained with BPE merge operations ranging from 0 to 32k. Among the three tables, the shallow-lstm architecture has the minimal variation with regard to different merge operation choices. For tiny-lstm, we observe a drastic performance drop between BPE merge operations 0/500 or 500/1k. But aside from these two settings, the variation is of similar scale to shallow-lstm. For deep-lstm, the variation is even larger than the Transformer architectures, and compared to tiny-lstm and shallow-lstm, the optimal BPE configuration shifts smaller BPE merge operations. However, we have also noticed that the overall absolute BLEU score of deep-lstm is lower than shallow-lstm despite more parameter is being used. We conjecture that the larger variation and lower BLEU score from the deep-lstm experiments is largely due to the overfitting effect on the small training data. Despite this effect, moving from tiny to deep model, we observe a trend that deeper models tends to make use of smaller BPE size better. In general, we conclude that unlike Transformer architecture, there is no typical optimal BPE configuration setting for the LSTM architecture. Because of this noisiness, we urge that future work using LSTM-based baselines tune their BPE configuration in wider range on a development set to the extent possible, in order to ensure reasonable comparison.
4.2 Analysis 2: Joint vs Separate BPE
Another question that is not extensively explored in the existing literature is whether joint BPE is the definitive better approach to apply BPE. The alternative way, referred to here as separate BPE, is to build separate models for source and target side of the parallel corpus. DBLP:conf/acl/SennrichHB16a conducted experiments with both joint and separate BPE, but these experiments were conducted with different BPE size, and not much analysis was conducted on the separate BPE model. DBLP:conf/wmt/HuckRF17 is the only other work we are aware of that used with separate BPE models for their study. It was mentioned that their joint BPE vocabulary of 59500 yielded a German vocabulary twice as large as English, which is an undesirable characteristic for their study.
Before comparing the system performance, we would like to systematically understand how is the resulting vocabulary different when jointly and separately applying BPE. Table 4 shows two most typical cases for this comparison, namely the Arabic-English language pair and the French-English language pair. The reason these two language pairs are typical is that for Arabic-English, the script of the two languages is completely different, while the French and English scripts only have minor difference. It could be seen that for Arabic-English language pair, the Arabic vocabulary size is always roughly twice the size of the English vocabulary. Upon examination, we see that roughly half of the Arabic vocabulary is consisted of English words and subwords, scattering over around 2% of the lines in the Arabic side of the training corpus.444These English tokens are generally English names, URLs or other untranslated concepts or acronyms. Hence, for most sentence pairs in the training data, the effective Arabic and English vocabulary under joint BPE model is still roughly the same size. On the other hand, because of extensive subword vocabulary sharing, at lower BPE size, the vocabulary size for French and English is always roughly the same as the number of BPE merge operations regardless of separate or joint BPE. However, this equality starts to diverge as more BPE merge operations are conducted, because the vocabulary difference between French and English starts to play out in this scenario. Unlike Arabic-English, it is hard to predict what is the resulting BPE size from the number of merge operations used, because it is hard to know how many resulting subwords will be shared between the two languages.
Table 5 show our experimental results with separate BPE and our base architectures.555We only ran experiments on 2k, 8k and 32k to save computation time. With the configurations we experimented with, the difference between the best separate BPE performance and the best joint BPE performance seems minimal. On the other hand, while the worst BPE configuration remains the same for separate BPE models, we see even worse performance for Transformer at 32k separate BPE most of the time. We think this is a continuation of the trend observed in our main results, as the vocabulary size tends to be even larger than joint BPE when applying separate BPE models.
Given the negligible difference in model performance, we think it may not be necessary to sweep BPE merge operations for both joint and separate settings. It may suffice to focus on the setting that makes the most sense for the task at hand, and focus on hyper parameter search within that setting.
4.3 Analysis 3: Languages
We are interested in what properties of the language have the most impact on the variance of BLEU score with regard to different BPE configurations. For our main experiments, we can already see a pretty consistent trend that for deep-transformer architecture, 0.5k and 32k merge operations always roughly correspond to the best and worst BPE configurations, respectively. To add more data points, we assume 0.5k and 32k are always the best and the worst configurations and build systems with these two configurations with both translation directions of 6 more languages pairs, namely, translating of English into and out of Brazilian Portuguese (pt), Hebrew (he), Russian (ru), Turkish (tr), Polish (pl) and Hungarian (hu). Table 6 shows the result with these 6 language pairs. We note that our observation for the 4 language pairs generalize well for the extra 6 language pairs, and we observe a similar magnitude of performance drop as the other language pairs moving from 0.5k to 32k.
To acquire insights for the aforementioned problem, we conduct a linear regression analysis using the linguistic features of the the 10 language pairs as independent variables and BLEU score difference between 0.5k and 32k merge operation settings as the dependent variable.666Note that for language pairs in our main results, these may not necessarily the best or the worst system. But the readers shall see that the difference is pretty minimal. The linguistic features of our interest are described as follows:
Type/Token Ratio: Taken from bentz2016comparison this is the ratio between number of token types and the number of tokens in the training corpus, ranging . These are computed separately for source and target language and denoted as and respectively.
Alignment Ratio: Also taken from bentz2016comparison, this is the relative difference between the number of many-to-one alignments and one-to-many alignments in the training corpus, ranging . We followed the same alignment setting as in DBLP:journals/corr/abs-1809-02223. This is computed together for each parallel training corpus and denoted as .
Morphological Type: Taken from gerz-etal-2018-relation, where for each language a morphological type from the following categories was assigned: Isolating, Fusional, Introflexive and agglutinative. Since our language choice avoided languages with isolating morphological type, we ended up using 6 binary features, namely is_src_fusional (denoted as ), is_src_introflexive (denoted as ), is_src_agglutinative (denoted as ) and same for target side.
The features were re-normalized to the
region with the min-max normalization. Our linear regression analysis was conducted with Ordinary Least Squares (OLS) model in the Pythonstatsmodels777https://pypi.org/project/statsmodels/0.9.0/ package.
shows the regression result. Surprisingly, we don’t see any strong correlation between the type/token ratio, alignment ratio and the variance in BPE. On the other hand, the regression points out that having agglutinative language on the source side and fusional language on the target side increases such variance. While we have seen significant BPE variances for all the experiments with Transformer, we think future work should be especially cautious with systems that translate out of agglutinative language and into fusional language (note that English is classified as fusional language in this regime).
We conduct a systematic exploration over various numbers of BPE merge operations to understand its interaction with system performance. We conduct this investigation over different NMT architectures including encoder-decoder and Transformer, and language pairs in both translation directions. We leave the study on the effect of BPE on high-resource settings and more language pairs, especially morphologically isolating languages, for future work. Subword regularization could also be studied in this manner.
Based on the findings, we make the following recommendations for selecting BPE merge operations in the future:
For Transformer based architectures, we recommend the sweep be concentrated in the k range.
For Shallow LSTM architectures, we find no typically optimal BPE merge operation and therefore urge future work to sweep over k to the extent possible.
We find no significant performance differences between joint BPE and seperate BPE and therefore recommend BPE sweep be conducted with either of these settings.
Furthermore, we strongly urge that the aforementioned checks be conducted when translating into fusional languages (such as English or French) or when translating from agglutinative languages (such as Turkish).
Our hope is that future work could take the experiments presented here to guide their choices regarding BPE and wordpiece configurations, and that readers of low-resource NMT papers call for appropriate skepticism when the BPE configuration for the experiments appears to be sub-optimal.
- [Bahdanau et al.2015] Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
- [Bentz et al.2016] Bentz, Christian, Tatyana Ruzsics, Alexander Koplenig, and Tanja Samardzic. 2016. A comparison between morphological complexity measures: typological data vs. language corpora. In Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC), pages 142–153.
[Cherry et al.2018]
Cherry, Colin, George Foster, Ankur Bapna, Orhan Firat, and Wolfgang Macherey.
Revisiting character-based neural machine translation with capacity
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 4295–4305.
- [Denkowski and Neubig2017] Denkowski, Michael J. and Graham Neubig. 2017. Stronger baselines for trustable results in neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation, NMT@ACL 2017, Vancouver, Canada, August 4, 2017, pages 18–27.
- [Gerz et al.2018] Gerz, Daniela, Ivan Vulić, Edoardo Maria Ponti, Roi Reichart, and Anna Korhonen. 2018. On the relation between linguistic typology and (limitations of) multilingual language modeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 316–327, Brussels, Belgium, October-November. Association for Computational Linguistics.
- [Huck et al.2017] Huck, Matthias, Simon Riess, and Alexander M. Fraser. 2017. Target-side word segmentation strategies for neural machine translation. In Proceedings of the Second Conference on Machine Translation, WMT 2017, Copenhagen, Denmark, September 7-8, 2017, pages 56–67.
- [Kingma and Ba2014] Kingma, Diederik P and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- [Koehn et al.2007] Koehn, Philipp, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In ACL 2007, Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, June 23-30, 2007, Prague, Czech Republic.
Subword regularization: Improving neural network translation models with multiple subword candidates.In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 66–75.
- [Luong et al.2015] Luong, Thang, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 1412–1421.
- [Ott et al.2019] Ott, Myle, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038.
- [Post2018] Post, Matt. 2018. A call for clarity in reporting bleu scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels, October. Association for Computational Linguistics.
- [Qi et al.2018] Qi, Ye, Devendra Singh Sachan, Matthieu Felix, Sarguna Padmanabhan, and Graham Neubig. 2018. When and why are pre-trained word embeddings useful for neural machine translation? In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short Papers), pages 529–535.
- [Renduchintala et al.2018] Renduchintala, Adithya, Pamela Shapiro, Kevin Duh, and Philipp Koehn. 2018. Character-aware decoder for neural machine translation. CoRR, abs/1809.02223.
- [Sennrich et al.2016] Sennrich, Rico, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers.
- [Sennrich et al.2017] Sennrich, Rico, Alexandra Birch, Anna Currey, Ulrich Germann, Barry Haddow, Kenneth Heafield, Antonio Valerio Miceli Barone, and Philip Williams. 2017. The university of edinburgh’s neural MT systems for WMT17. In Proceedings of the Second Conference on Machine Translation, WMT 2017, Copenhagen, Denmark, September 7-8, 2017, pages 389–399.
- [Sutskever et al.2014] Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3104–3112.
- [Vaswani et al.2017] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 6000–6010.
- [Wu et al.2016] Wu, Yonghui, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144.