Despite the advances of neural machine translation (NMT), building effective translation systems for lower-resourced and morphologically rich languages remains a challenging process. The lack of large training data sets tends to lead to problems of vocabulary sparsity, a problem exacerbated by the combinatorial explosion of permissible surface forms commonly encountered when working with morphologically rich languages. Current NMT systems typically operate at the level of subwords. Most commonly, these systems achieve vocabulary reduction by decomposing tokens into character sequences constructed by maximizing an information-theoretic compression criterion. The most widely used subword segmentation method is byte pair encoding, originally invented in the data compression literature by gage1994new, and introduced to the MT community by sennrich-etal-2016-neural
. Another approach to open vocabulary NMT has been to compose characters or character n-grams to form word representations(ataman2018compositional; ling2015character)
. As BPE has become mainstream, the question of whether segmenting words in a linguistically-informed fashion provides a benefit remains open. Intuitively, the translation task may be easier when using subwords that contain maximal linguistic signal, as opposed to heuristically derived units based on data compression. The greatest benefit may come in low-resource settings, where the training data is small and biases toward morphological structure may lead to more reusable units. We seek to address this question by exploring the usefulness of linguistically-motivated subword segmentation methods in NMT, as measured against a BPE baseline. Specifically, we investigate the effectiveness of morphology-based segmentation algorithms ofataman2017linguistically and lignos2010learning as alternatives to BPE at the word or sentence level and find that they do not lead to reliable improvements under our experimental conditions. We perform our evaluation using both BLEU (papineni-etal-2002-bleu) and CHRF3 (popovic-2015-chrf). In our low-resource NMT setting, all these methods provide comparable results. The contribution of this work is that it provides insights into the performance of these segmentation methods using a thorough experimental paradigm in a highly replicable environment. We evaluate without the many possible confounds related to back-translation and other processes used in state-of-the-art NMT systems, focusing on the performance of a straightforward Transformer-based system. To analyze the performance differences between the various segmentation strategies, we utilize a Bayesian linear model as well as nonparametric hypothesis tests.
|Translation task||Split||Sentences||Tokens (EN)||Tokens (non-EN)|
|KK EN||Train (120k)||124,770||379,546||319,484|
|KK EN||Train (220k)||222,424||1,717,414||1,365,605|
2 Related work
Attempts to create unsupervised, morphologically-aware segmentations have often been derived from the Morfessor family of morphological segmentation tools (virpioja2013morfessor). In addition to extensions of Morfessor, such as Cognate Morfessor (gronroos-etal-2018-cognate), ataman2017linguistically and ataman-federico-2018-evaluation introduced the LMVR model, derived from Morfessor FlatCat (gronroos-etal-2014-morfessor), and applied it to NMT tasks on Arabic, Czech, German, Italian, Turkish and English, noting that LMVR outperforms a BPE baseline in CHRF3 and BLEU. Contrary to their results, however, toral-etal-2019-neural find that using LMVR yielded mixed results: on a Kazakh-English translation task the authors observed marginal BLEU improvements over BPE, whereas for English-Kazakh, the authors reported LMVR to perform marginally worse than BPE in terms of CHRF3. There have also been efforts to combine BPE with linguistically motivated approaches. For instance, huck-etal-2017-target propose to combine BPE with various linguistic heuristics such as prefix, suffix, and compound splitting. The authors work with English-German and German-English tasks, and observe performance improvements of approximately 0.5 BLEU compared to a BPE-only baseline. As another example, weller-di-marco-fraser-2020-modeling combine BPE with a full morphological analysis on the source and target sides of an English-German translation task, and report performance improvements exceeding 1 BLEU point over a BPE-only baseline. Finally, even though sennrich-etal-2016-neural originally only used the NMT training set to train their segmentation model, others have recently found benefit in adding monolingual data to the process. In particular, scherrer-etal-2020-university used both SentencePiece and Morfessor as segmentation models on an Upper Sorbian–German translation task and found a monotonic increase in BLEU when the segmentation model was trained with additional data, while at the same time keeping the NMT training data constant.
|Original||The nation slowly started being centralized and during|
|SentencePiece||_the _n ation _sl ow ly _start ed _being _cent ral ized _and _d ur ing|
|Subword-NMT||the n ation s low ly star ted being cen tr ali z ed and d ur ing|
|LMVR||the nation s +low +ly st +ar +ted be +ing c +ent +ral +ized and d +ur +ing|
|MORSEL||the nation s low +ly start +ed being cen tr ali z +ed and du r +ing|
To investigate the effect of subword segmentation algorithms on NMT performance, we train translation models using the Transformer architecture of vaswani2017attention. We base our work on two recent datasets: FLoRes (guzman-etal-2019-flores), and select languages from the WMT 2019 Shared Task on News Translation (barrault-etal-2019-findings). Corpus statistics for all corpora can be found in Table 1. The FLoRes dataset consists of two language pairs, English-Nepali and English-Sinhala. To add another lower-resourced language, we use the Kazakh-English translation data from WMT19. In terms of morphological typology, both Nepali and Sinhala are agglutinative languages (prasain2011computational; priyanga-etal-2017-sinhala), as is Kazakh (kessikbayeva-cicekli-2014-rule). We conduct two sets of experiments on Kazakh to investigate how the amount of training data influences our results: first, we train only on the WikiTitles and News Commentary corpora (train120k), followed by another set of experiments (train220k
) where we include the web crawl corpus prepared by Bagdat Myrzakhmetov of Nazarbayev University. We also conducted experiments with Gujarati data from WMT19, but BLEU scores were too low to allow for meaningful analysis. For our models, we generally follow the architecture and hyperparameter choices of the FLoRes Transformer baseline, except for settingclip_norm to 0.1 and enabling FP16 training. Despite the widespread use of auxiliary techniques such as back-translation we deliberately refrain from employing such techniques in this work. This is done to best isolate the effect of varying the subword segmentation algorithm, and to avoid the complexity of disentangling it from the effect of other factors. It should be noted, however, that such techniques were highly prevalent among of systems submitted to the KKEN WMT19 News Translation Shared Task: 64% used back-translation, 61% used ensembling, and 57% employed extensive corpus filtering (barrault-etal-2019-findings).
3.1 Subword segmentation algorithms
Below we describe our hyperparameter settings for the various subword segmentation algorithms. Sinhala and Nepali are tokenized using the Indic NLP tokenizer (kunchukuttan2020indicnlp), whereas for English and Kazakh we use the Moses tokenizer (koehn-etal-2007-moses). Example segmentations from actual data can be seen in Table 2. The segmentation methods we evaluate learn their subword vocabularies from frequency distributions of tokenized text. The exception to this is SentencePiece, whose subword units are learned from sentences, including whitespace. In the case of English and Kazakh, these sentences are untokenized whereas for Nepali and Sinhala, preprocessing with the Indic NLP tokenizer is applied following the approach of guzman-etal-2019-flores.
3.1.1 Subword-NMT and SentencePiece
As our baseline subword segmentation algorithm, we use the BPE implementation from Subword-NMT111https://github.com/rsennrich/subword-nmt. Throughout our experiments we use a joint vocabulary of the source and target and set the number of requested symbols to 5,000. For SentencePiece, we use the default BPE implementation222https://github.com/google/sentencepiece with a joint vocabulary size of 5,000 words. These choices are motivated by the general observation by sennrich-zhang-2019-revisiting that lowering BPE size improves translation quality in ultra-low resource conditions, and the specific value of 5,000 was previously used by guzman-etal-2019-flores. The same small vocabulary size has been used elsewhere in the low-resource NMT literature, for instance by roest2020morphological while training NMT systems for Inuktitut. We also conducted a hyperparameter sweep for 2,500, 5,000, 7,500 and 10,000 merge operations, but noticed no improvement over the choice of 5,000 motivated by prior work.
For LMVR (ataman2017linguistically), we utilize slightly modified versions of the sample scripts from the author’s Github repository333https://github.com/d-ataman/lmvr. Our main modification is tuning the corpusweight hyperparameter in the Morfessor Baseline (virpioja2013morfessor) model used to seed the LMVR model. Tuning is performed by maximizing the F1 score for segmenting the English side of the training data, using the English word lists from the Morpho Challenge 2010 shared task (kurimo2010proceedings) as gold standard segmentations. After tuning the Morfessor Baseline model, we train a separate LMVR model for each language in a language pair using a vocabulary size parameter of 2,500 per language.
MORSEL (lignos2010learning) provides linguistically-motivated unsupervised morphological analysis that has been shown to work effectively on small datasets (chan2010investigating). While it provides derivations of morphologically complex forms via a combination of stems and affix rules, we modified it to provide a segmentation and then postprocessed its output to apply BPE to the stems to yield a limited-size vocabulary. For example, on the English side of the NE-EN training data, MORSEL analyzes the word algebraic as resulting from the stem algebra being combined with the suffix rule +ic. A BPE model is trained on all of the stems in MORSEL’s analysis, and when that is applied to the stem, it is segmented as al@@ ge@@ br@@ a. The stem and suffix are combined using a special plus character to denote suffixation, so the final segmentation is al@@ ge@@ br@@ a +ic. Tuning is performed as with LMVR, using the English word lists from the Morpho Challenge 2010 shared task (kurimo2010proceedings) as a reference. We adjust the number of BPE units learned from the stems to keep the total per-language vocabulary below 2,500.
4 Results and analysis
|LMVR||1.00 0.12||21.98 0.41|
|MORSEL||0.94 0.11||21.24 0.89|
|SentencePiece||1.04 0.09||21.48 0.47|
|Subword-NMT||1.32 0.08||22.12 0.28|
|LMVR||1.82 0.13||22.74 0.84|
|MORSEL||2.06 0.11||22.88 0.40|
|SentencePiece||2.18 0.08||22.78 0.43|
|Subword-NMT||1.94 0.22||22.62 0.88|
|LMVR||1.70 0.07||23.72 0.44|
|MORSEL||2.62 0.08||26.26 0.36|
|SentencePiece||2.34 0.21||24.64 0.81|
|Subword-NMT||3.14 0.18||25.92 0.54|
|LMVR||9.42 0.26||33.88 0.76|
|MORSEL||10.44 0.48||34.58 0.88|
|SentencePiece||10.02 0.29||33.50 0.54|
|Subword-NMT||10.68 0.34||35.52 0.41|
|LMVR||4.32 0.04||31.00 0.29|
|MORSEL||4.38 0.16||31.28 0.47|
|SentencePiece||4.58 0.15||31.36 0.35|
|Subword-NMT||4.42 0.16||30.96 0.34|
|LMVR||7.84 0.11||34.10 0.16|
|MORSEL||5.30 0.30||28.18 0.97|
|SentencePiece||8.42 0.23||34.40 0.73|
|Subword-NMT||8.46 0.15||34.18 0.13|
|LMVR||1.44 0.32||28.22 0.30|
|MORSEL||1.12 0.13||27.44 0.34|
|SentencePiece||1.08 0.31||27.56 0.43|
|Subword-NMT||0.88 0.13||26.78 0.51|
|LMVR||7.24 0.22||32.16 0.63|
|MORSEL||7.78 0.16||34.32 0.30|
|SentencePiece||7.52 0.08||33.58 0.43|
|Subword-NMT||7.76 0.25||34.38 0.38|
Mean and standard deviation of BLEU and CHRF3 across translation tasks and segmentation methods. Underlined values represent the highest mean scores. Bolded values are not significantly different () than the highest score as determined by Dunn’s test.
Our experimental results can be seen in Table 3. All BLEU scores were computed using sacrebleu, and all CHRF3 scores using nltk. Each row consists of the mean and standard deviation computed across 5 random seeds for each configuration. We also plot the raw results in Figure 1. Table 4 gives counts for the number of times each segmentation approach was the top-performing one or statistically indistinguishable from it. Table 7 in the appendix gives p-values for all comparisons performed. Overall, based on Tables 3 and 4, no segmentation method seems to emerge as the clear winner across translation tasks, although BPE applied at the token (Subword-NMT) or sentence (SentencePiece) level performs well consistently. Subword-NMT or SentencePiece perform best in 12 out of 16 cases (counting BLEU and CHRF3 for each translation task), while morphology-based methods rank best in 4 out of 16 cases. In particular, we note that morphology-based methods seem to achieve or tie the best BLEU performance for translation tasks involving SI, and best CHRF3 performance for KK-EN with smaller training data (train120k) as well as EN-SI. However, when using LMVR, we fail to find the significant gains in BLEU compared to BPE reported by ataman2017linguistically. Comparing our results to guzman-etal-2019-flores, we note that the scores are similar, although not directly comparable as we report lowercased BLEU scores.444We lowercased all data in preprocessing because MORSEL and Morfessor, which LMVR is derived from, are designed to operate on lowercase inputs. They report EN-NE/NE-EN baseline BLEU scores of 4.3 and 7.6 using a single random seed, which are in line with our results in Table 3. For EN-SI/SI-EN, the authors report 1.2 and 7.2 BLEU, which likewise matches our findings. Even though our scores are low overall, they are as low as is to be expected using this approach, size of data, and languages. In order to compare our results to WMT19 participant systems, it is only meaningful to compare our system to baseline systems due to the widespread use of auxiliary training techniques, such as back-translation. For instance, casas-etal-2019-talp report baseline NMT scores of 2.32 on KK-EN and 1.42 on EN-KK, which are in line with our MORSEL and SentencePiece results on KK-EN, and Subword-NMT results on EN-KK in the train120k condition.
4.1 Modeling BLEU and CHRF3
|SentencePiece - Subword-NMT||-0.05 0.08||-0.07 0.20|
|MORSEL - Subword-NMT||-0.12 0.07||0.02 0.18|
|LMVR - Subword-NMT||-0.26 0.06||-0.19 0.21|
Based on Figure 1 and Tables 3 and 4, the BLEU and CHRF3 scores vary with both the translation task and segmentation method. Intuitively, the scores seem to cluster around a certain range for each translation task, and are perturbed slightly depending on the choice of segmentation method. To better disentangle the influence of these factors, we fit a Bayesian linear model to the experimental data, treating the final BLEU/CHRF3 score as a sum of a “translation task effect” , a “segmentation method effect” , and a translation task-specific noise term .555In the appendix, Section A gives details of our model, and Table 6 gives the point estimates of the posterior mean and standard deviation for and . The and
terms are estimated for each of the eight translation tasks (e.g. SI-EN and EN-SI are estimated separately), andis estimated for each of the four segmentation methods using results from all translation tasks. To explicitly compare SentencePiece, LMVR and MORSEL to the Subword-NMT baseline, we also model the pairwise differences between each method’s -term and that of Subword-NMT. The posterior inferences for these quantities can be seen in Table 5 and are plotted in the appendix. For BLEU, the differences for LMVR are several standard deviations below 0, suggesting that it performs worse than the Subword-NMT baseline when accounting for all translation tasks. Similarly, MORSEL is almost 2 standard deviations away from 0, though its posterior interval does cover 0. In both cases, the effect size is small, with a mean of -0.12 and -0.26 points of BLEU for MORSEL and LMVR, respectively. The reliability of this difference also disappears for LMVR under the CHRF3 model, where no segmentation method’s posterior mean is several standard deviations away from 0. We hypothesize that this greater discrimination among methods when using BLEU may originate from the differences between how BLEU and CHRF3 operate. Since CHRF3 is a character-level metric, it is less prone than BLEU to penalizing a given translation due to subword outputs that are almost correct. For instance, consider output of do@@ gs dogs with dog as the reference; while CHRF3 awards credit for this as a partial match, BLEU treats it as entirely incorrect. This further underscores our observation that segmentation methods perform inconsistently across experimental conditions.
5 Conclusion and future work
Contrary to our hypothesis about the usefulness of morphology-aware segmentation, we see no consistent advantage, and possibly a small disadvantage, to using LMVR or MORSEL in this resource-constrained setting. By and large, our experiments and modeling show that no segmentation approach consistently achieves the best BLEU/CHRF3 across all translation tasks. BPE remains a good default segmentation strategy, but it is possible that LMVR, MORSEL, or similar systems may show larger performance advantages for languages with specific morphological structures. Consequently, we believe further work is needed to better understand when morphology-aware methods are most effective and to develop methods that provide a consistent advantage over BPE. One such avenue of future work would be to broaden our analysis to more languages and include languages that are higher-resourced but morphologically rich and as well as ones that are lower-resourced but morphologically poor. Ortega2021, which we encountered during preparation of the final version of this paper, began to address these questions by comparing Morfessor with BPE and their own BPE variant on Finnish, Quechua and Spanish. An alternative approach which we intend to pursue in future work is experimenting with supervised morphological segmenters or analyzers that can be efficiently developed even in lower-resourced settings. Incorporating such “gold standard” segmentations may make it clearer whether the unsupervised morphological segmenters are capturing linguistically-relevant structure. Finally, there is the question of whether BPE can approximate a general representation for a language instead of converging on a corpus-specific set of subwords. To test this, one can add monolingual data and train the BPE segmentation on that larger data set. Ideally the new, “enriched” segmentations would depend less on the specific vocabulary of the training corpus. As noted above, scherrer-etal-2020-university observed this approach to be helpful in terms of BLEU. However, it remains unknown why the subwords derived from a larger corpus perform better, and whether better identification of morphological structure could be responsible. We hope that this work and these ideas will catalyze further research, and that efficient methods for translating to and from lower-resourced languages can be developed as a result.
Appendix A Bayesian Linear Model Details
Mathematically, our model can be expressed as:
where , and represent the “translation task effect” and “segmentation method effect,” and
is a translation task-specific variance term. To initialize our Bayesian linear model from Equation1, we set the following priors. For the BLEU model, and . For the CHRF3 model, and . The priors are the same regardless of translation task or segmentation method. For our noise terms, we use a prior in all models. Our rationale for these priors is that
should place most of its probability mass within the observed range of BLEU/CHRF3, whereasshould, a priori, take on positive and negative values with equal probability, reflecting a lack of prior information. All models are fit using PyMC3, and MCMC posterior inference performed using the No-U-Turn Sampler.
|Segmentation method effect||(BLEU)||(CHRF3)|
|LMVR||-0.09 0.47||0.41 0.50|
|MORSEL||0.05 0.47||0.63 0.50|
|SentencePiece||0.12 0.47||0.53 0.50|
|Subword-NMT||0.17 0.47||0.60 0.50|
|SentencePiece - Subword-NMT||-0.05 0.08||-0.07 0.20|
|LMVR - Subword-NMT||-0.26 0.06||-0.19 0.21|
|MORSEL - Subword-NMT||-0.12 0.07||0.02 0.18|
|Translation task effect||(BLEU)||(CHRF3)|
|EN-KK (train120k)||1.01 0.47||21.16 0.52|
|EN-KK (train220k)||1.94 0.47||22.21 0.51|
|EN-NE||4.36 0.47||30.60 0.50|
|EN-SI||1.07 0.47||26.95 0.52|
|KK-EN (train120k)||2.39 0.48||24.58 0.56|
|KK-EN (train220k)||10.07 0.48||33.81 0.54|
|NE-EN||7.41 0.56||32.02 0.82|
|SI-EN||7.51 0.47||33.05 0.54|
All posterior means for are close to the average BLEU/CHRF3 scores per translation task observed in Table 3, and fall between 1.01 and 10.07 for the BLEU model, and 21.16 and 33.81 for the CHRF3 model. In contrast, the posterior means for are universally small: -0.09, 0.05, 0.12, and 0.17 for LMVR, MORSEL, SentencePiece and Subword-NMT, respectively, with a posterior standard deviation of 0.47. The -terms under the CHRF3 model exhibit a similar pattern: 0.41, 0.63, 0.53, 0.60, with a posterior standard deviation of 0.50. Compared to the posterior standard deviation, as well as translation task effects , the -terms are practically 0. This, in conjunction with our analysis using Dunn’s test, suggests that there is not a segmentation method that consistently works best across translation tasks. Figures 2 and 3
show posterior predictive distributions for the BLEU and CHRF3 models. Figure4 shows the posterior distribution of pairwise differences between each of the other segmentation methods and Subword-NMT.
|Language pair||Segmentation method||p-value (BLEU)||p-value (CHRF3)|