Unsupervised machine translation is of particular significance for low-resource language pairs. In contrast to traditional machine translation, it does not rely on large amounts of parallel data. When parallel data is scarce, both neural machine translation (NMT) and phrase-based machine translation (PBMT) systems can be trained using large monolingual corpora Artetxe et al. (2018b, c); Lample et al. (2018).
Our translation systems submitted to WMT19 were created in several steps. Following the strategy of Artetxe et al. (2018b), we first train monolingual phrase embeddings and map them to the cross-lingual space. Secondly, we use the mapped embeddings to initialize the phrase table of the PBMT system which is first tuned and later refined with back-translation. We then translate the Czech monolingual corpus by the PBMT system to produce several synthetic parallel German-Czech corpora. Finally, we train a supervised NMT system on a filtered synthetic data set, where we exclude sentences tagged as “not Czech”, shuffle the word order and handle mistranslated name entities. The training pipeline is illustrated in Figure 1.
The structure of this paper is the following. The existing approaches used to build our system are described in Section 2. The data for this shared task is described in Section 3. Section 4 gives details on phrase embeddings. Section 5 describes the phrase-based model and how it was used to create synthetic corpora. Section 6 proceeds to the neural model trained on the synthetic data. Section 7 introduces the benchmarks to compare our systems with supervised NMT and Section 8 reports the results of the experiments. Finally, Section 9 summarizes and concludes the paper.
Unsupervised machine translation has been recently explored by artetxe2018nmt, artetxe2018smt and lample2018. They propose unsupervised training techniques for both the PBMT model and the NMT model as well as a combination of the two in order to extract the necessary translation information from monolingual data. For the PBMT model Lample et al. (2018); Artetxe et al. (2018b)
, the phrase table is initialized with an n-gram mapping learned without supervision. For the NMT modelLample et al. (2018); Artetxe et al. (2018c), the system is designed to have a shared encoder and it is trained iteratively on a synthetic parallel corpus which is created on-the-fly by adding noise to the monolingual text (to learn a language model by de-noising) and by adding a synthetic source side created by back-translation (to learn a translation model by translating from a noised source).
The key ingredient for functioning of the above mentioned systems is the initial transfer from a monolingual space to a cross-lingual space without using any parallel data. zhang2017 and conneau2018 have inferred a bilingual dictionary in an unsupervised way by aligning monolingual embedding spaces through adversarial training. artetxe2018vecmap propose an alternative method of mapping monolingual embeddings to a shared space by exploiting their structural similarity and iteratively improving the mapping through self-learning.
In line with the rules of the WMT19 unsupervised shared task, we trained our models on the NewsCrawl111http://data.statmt.org/news-crawl/ corpus of newspaper articles collected over the period of 2007 to 2018.
We tokenized and truecased the text using standard Moses scripts. Sentences with less than 3 or more than 80 tokens were removed. The resulting monolingual corpora used for training of the unsupervised PBMT system consisted of 70M Czech sentences and 267M German sentences.
We performed further filtering of the Czech corpus before the NMT training stage. Since there are a lot of Slovak sentences in the Czech NewsCrawl corpus, we used a language tagger
langid.py Lui and Baldwin (2012) to tag all sentences and remove the ones which were not tagged as Czech. After cleaning the corpus, the resulting Czech training set comprises 62M sentences.
Since small parallel data was allowed to tune the unsupervised system, we used newstest2013 for development of the PBMT system. Finally, we used newstest2012 to select the best PBMT model and newstest2010 as the validation set for the NMT model.
4 Phrase Embeddings
The first step towards unsupervised machine translation is to train monolingual n-gram embeddings and infer a bilingual dictionary by learning a mapping between the two embedding spaces. The resulting mapped embeddings allow us to derive the initial phrase table for the PBMT model.
We first train phrase embeddings (up to trigrams) independently in the two languages. Following artetxe2018smt, we use an extension of the word2vec skip-gram model with negative sampling Mikolov et al. (2013) to train phrase embeddings. We use a window size of 5, embedding size of 300, 10 negative samples, 5 iterations and no subsampling. We restricted the vocabulary to the most frequent 200,000 unigrams, 400,000 bigrams and 400,000 trigrams.
4.2 Output: Unsupervised Phrase Table
The output of this processing stage is the unsupervised phrase table which is filled with source and target n-grams. For the sake of a reasonable phrase table size, only the 100 nearest neighbors are kept as translation candidates for each source phrase. The phrase translation probabilities are derived from a softmax function over the cosine similarities of their respective mapped embeddingsArtetxe et al. (2018a).
5 PBMT Model
We followed the Monoses222https://github.com/artetxem/monoses
pipeline of artetxe2018smt for our unsupervised phrase-based system. The initial translation model is estimated based on the unsupervised phrase table induced from the mapped embeddings and the language model is estimated on the monolingual data. The reordering model is not used in the first step. The initial model is tuned and later iteratively refined by back-translationSennrich et al. (2016).
The models are estimated using Moses Koehn et al. (2007), with KenLM Heafield (2011) for 5-gram language modelling and fast_align Dyer et al. (2013) for alignments. The feature weights of the log-linear model are tuned using Minimum Error Rate Training.
The back-translation process is illustrated in Figure 2. Both decs and csde systems are needed at this step. The decs system is used to translate a portion of the German monolingual corpus to Czech and create a synthetic parallel data set, which is then used to train the csde system and the procedure continues the other way around. We note that we do not make use of the initial model for csde. Once the synthetic parallel data set is created, the problem turns into a supervised one and we can use standard PBMT features, including the standard phrase table extraction procedure and the reordering model estimated on the aligned data sets.
Since back-translation is computationally demanding, we experimented with using a synthetic data set of 2 and 4 million sentences for back-translation rather than translating the whole monolingual corpus.
|Iteration No.||Authentic Dev Set||Synthetic Dev Set|
5.2 Output: PBMT Systems (csde)
We evaluated various PBMT models to select the best candidate to translate the whole monolingual corpus from Czech to German. The translation quality was measured on newstest2012.
We experimented with tuning the model both on an authentic parallel development set (3K sentence pairs) and a synthetic back-translated development set (10K sentence pairs). In the first scenario, possibly as a result of a smaller development set, the model started diverging after the first round of back-translation. In the second scenario, the best result is achieved after two and three rounds of back-translation for the csde and decs model, respectively (see the results in Table 1).
We selected the csde model with the highest BLEU of 14.22 for creating the synthetic corpus for the initial training of the NMT system. This PBMT model was tuned on a synthetic development set with two rounds of back-translation).
However, after reviewing the translations and despite the BLEU results, we kept also the csde model with a BLEU score of 12.06 which was tuned on authentic parallel data. The translations were superior especially in terms of the word order.
5.3 Output: Synthetic Corpora
The training data sets for our NMT models were created by translating the full target monolingual corpus (filtered as described in Section 3) from Czech to German using the best performing csde PBMT models. Due to time constraints, we were gradually improving our PBMT models and already training the NMT model on the synthetic data. As a result, the final NMT model used synthetic data sets of increasing quality in four training stages.
5.3.1 Frequent Errors in Synthetic Corpora
We read through the translations to detect further error patterns which are not easily detectable by BLEU but have a significant impact on human evaluation. We noticed three such patterns:
wrong word order (e.g. in contrast to the Czech word order, verbs in subordinate clauses and verbs following a modal verb are at the end of a sentence in German);
non-translated Czech words on the synthetic German side of the corpus (e.g. a German synthetic phrase auf písčitém Küste where the Czech word písčitém (sandy) remains non-translated);
randomly mistranslated named entities (NEs) (e.g. king Ludvik translated as king Harold or Brno translated as Kraluv Dvur).
5.3.2 Heuristics to Improve Synthetic Corpora
In order to reduce the detrimental effects of the above errors, we created several variations of the synthetic corpora. Here we summarize the final versions of the corpora that served in the subsequent NMT training:
The PBMT-Unsupervised-bestBLEU model was used for creating the data set for the initial training of the model. All submitted systems were trained on this initial training set.
This time we translated the Czech corpus by the PBMT-Unsupervised-wordOrder model. We cleaned the German side of the synthetic corpus by removing the Czech words which the PBMT model failed to translate and only copied. We identified words with Czech diacritics and replaced them on the German side with the unk token.
Before we removed the non-translated words from the synthetic corpus, the NMT model frequently saw the same Czech words in both the source and the target during training and learned to copy these words. As a result, also the final Czech translations often included German words directly copied from the source. After fine-tuning on the cleaned corpus, the models rarely copy German words during the translation to Czech.
was further treated to improve the word order in the synthetic corpus. We shuffled words in the synthetic German sentences within a 5-word window and mixed the reordered sentences into the original ones. We essentially doubled the size of the training corpus by first reordering odd-indexed sentences while keeping even-indexed sentences intact and then vice versa.
The motivation for this augmentation was to support the NMT system in learning to handle word reordering less strictly, essentially to improve its word order denoising capability. Ideally, the model should learn that German word order need not be strictly followed when translating to Czech. This feature is easy to observe in authentic parallel texts but the synthetic corpora are too monotone. We are aware of the fact that a 5-word window is not sufficient to illustrate the reordering necessary for German verbs but we did not want to introduce too language-specific components to our technique.
The SynthCorpus-noCzech-reordered was further treated to alleviate the problem of mistranslated NEs present in the data.
NEs were identified in the monolingual Czech corpus by a NE recognition tagger NameTag333http://ufal.mff.cuni.cz/nametag Straková et al. (2014). The model was trained on the training portion of the Czech Named Entity Corpus 2.0444http://ufal.mff.cuni.cz/cnec/cnec2.0 which uses a detailed two-level named entity hierarchy. We then used automatic word alignments (fast_align) between the Czech side and the synthetic German side of the corpus and checked the German counterparts of automatically-identified Czech NEs. If the German counterpart was close enough (Levenshtein distance of at most 3) to the Czech original, we trusted the translation. In other cases, we either copied the NE from the source or we used unk on the German side, preventing the subsequent NMT system from learning a mistranslation. Instead, the unk should never match any input and the NMT system should be forced to fall back to its standard handling of unknown words. Ideally, this would be to copy the word, but since there is no copy mechanism in our NMT setups, the more probable solution of the system would be to somehow circumvent or avoid the NE in the target altogether.
Named entity types and their treatment are listed in Table 2. Mistranslated NEs were treated in two stages. First during improving the synthetic corpora and then during post-processing, as described in Section 6.2.
|Named Entity Type||Pre-treatment||Post-treatment|
|Numbers in addresses||copied||copied|
6 NMT Model
6.1 Model and Training
We use the Transformer architecture by transformer implemented in Marian framework Junczys-Dowmunt et al. (2018)
to train an NMT model on the synthetic corpus produced by the PBMT model. The model setup, training and decoding hyperparameters are identical to the CUNI Marian systems in English-to-Czech news translation task in WMT19Popel et al. (2019), but in this case, due to smaller and noisier training data, we set the dropout between Transformer layers to 0.3. We use 8 Quadro P5000 GPUs with 16GB memory.
During post-processing of the translated Czech test set, we always adjusted quotation marks to suit Czech standards. Some systems were subject to further post-processing as indicated in the following section.
6.3 Output: NMT Systems
Our resulting systems share the same architecture and training parameters but they emerged from different stages of the training process as illustrated in Figure 1. The entire training process included training the system on the initial training corpus, fine-tuning on other corpora and final post-processing.
This system was trained on the initial synthetic data set SynthCorpus-Initial until convergence. We used early stopping after 100 non-improvements on validation cross-entropy, with validation step 1 000. The training finished after 3 days and 11 hours at 249 000 steps. Then we selected the checkpoint with the highest bleu-detok
, which was at 211 000 steps, in epoch 3.
No further fine-tuning was performed. This system was not submitted to WMT19.
This system was fine-tuned on the SynthCorpus-noCzech corpus for 4 hours, when it reached a maximum, and for another 4 hours on SynthCorpus-noCzech-reordered.
This system is a result of additional 4 hours of fine-tuning of the CUNI-Unsupervised system on the SynthCorpus-noCzech-reordered-NER corpus. Although the effect of this fine-tuning on the final translation might not be significant in terms of BLEU points, the problem of mistranslated named entities is perceived strongly by human evaluators and warrants an improvement.
The translations produced by CUNI-Unsupervised-NER were post-processed to tackle the remaining problem with named entities. We first trained GIZA++ Och and Ney (2003) alignments on 30K sentences. We used NameTag to tag NEs in Czech sentences and using the alignments, we copied personal names, geographical names and numbers from the German source to the Czech target.
We translated the test set by two models and combined the results. We used NameTag to tag Czech sentences with named entities and translated the tagged sentences by CUNI-Unsupervised-NER. The sentences with no NEs were translated by the CUNI-Unsupervised system.
|Original||Der Lyriker Werner Söllner ist IM Walter.|
|Reference||Básník Werner Söllner je tajný agent Walter.|
|CUNI-Unsupervised||Prozaik Filip Bubeníček je agentem StB Josefem.|
|CUNI-Unsupervised-NER||Prozaik Filip Söllner je agentem StB Ladislavem Bártou.|
|CUNI-Unsupervised-NER-post||Prozaik Werner Söllner je agentem StB Walter.|
|Winning Systems||Sentences with NEs||Sentences with no NEs|
|Winning Systems||Sentences with NEs||Sentences with no NEs|
For comparison, we created a NMT system using the same model architecture as above but training it in a supervised way on the German-Czech parallel corpus from Europarl (Koehn, 2005) and OpenSubtitles2016 (Tiedemann, 2012), after some cleanup pre-processing and character normalization provided by Macháček (2018). As far as we know, these are the only publicly available parallel data for this language pair. They consist of 8.8M sentence pairs and 89/78M tokens on the German and the Czech side, respectively. The system Benchmark-Supervised was trained from scratch for 8 days until convergence.
Our other comparison system, Benchmark-TransferEN, was first trained as an English-to-Czech NMT system (see CUNI Transformer Marian for the English-to-Czech news translation task in WMT19 by cuni-news-wmt19) and then fine-tuned for 6 days on the SynthCorpus-noCzech-reordered-NER
. The vocabulary remained unchanged, it was trained on the English-Czech training corpus. This simple and effective transfer learning approach was suggested byKocmi and Bojar (2018).
The scores of the systems on newstest2019 are reported in Table 3.
8 Final Evaluation
Table 5 summarizes the improvement we gained by introducing a special named entity treatment. We manualy evaluated three systems, CUNI-Unsupervised, CUNI-Unsupervised-NER and CUNI-Unsupervised-NER-post on a stratified subset of the validation data set created by randomly selecting 100 sentences with NEs and 100 sentences without NEs. The results are presented in two steps, the first table shows that fine-tuning the system CUNI-Unsupervised-NER on a synthetic corpus with amended NEs proved beneficial in 52% of tested sentences which included NEs and it did not harm in 20% of sentences. When comparing the two systems on sentences with no NEs, their performance is very similar.
Furthermore, adjusting NEs during post-processing proved useful in 18% of sentences with NEs and it did not harm in 68% of sentences. Post-processing introduced two types of errors: copying German geographical names into Czech sentences (e.g. translating Norway as Norwegen instead of Norsko) and replacing a Czech named entity with a word which does not correspond to it due to wrong alignments (e.g. translating Miss Japan as Miss Miss). On the other hand, when alignments were correct, the post-processing was able to fix remaining mismatches in named entities. See Table 4 for a sample translation.
This paper contributes to recent research attempts at unsupervised machine translation. We tested the approach of artetxe2018smt on a different language pair and faced new challenges for this type of translation caused by the non-similar nature of the two languages (e.g. different word order, unrelated grammar rules).
We identified several patterns where the initial translation models systematically failed and we focused on alleviating such issues during fine-tuning of the system and final post-processing. The most severe type of a translation error, in our opinion, was a large number of randomly mistranslated named entities which left a significant impact on the perceived translation quality. We focused on alleviating this problem both during fine-tuning of the NMT system and during the post-processing stage. While our treatment is far from perfect, we believe that an omitted named entity or a non-translated named entity causes less harm than a random name used instead.
While the performance of our systems still lags behind the supervised benchmark, it is impressive that the translations reach their quality without ever seeing an authentic parallel corpus.
This study was supported in parts by the grants SVV 260 453, 1050119 of the Charles University Grant Agency, 18-24210S of the Czech Science Foundation and the EU grant H2020-ICT-2018-2-825460 (ELITR).
This work has been using language resources and tools stored and distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (LM2015071).
- Artetxe et al. (2018a) Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018a. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: Volume 1, Long Papers.
Artetxe et al. (2018b)
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018b.
statistical machine translation.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.
- Artetxe et al. (2018c) Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2018c. Unsupervised neural machine translation. In Proceedings of the Sixth International Conference on Learning Representations.
- Conneau et al. (2018) Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2018. Word translation without parallel data. In International Conference on Learning Representations.
- Dyer et al. (2013) Chris Dyer, Victor Chahuneau, and Noah A. Smith. 2013. A simple, fast, and effective reparameterization of IBM model 2. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
- Heafield (2011) Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation.
- Junczys-Dowmunt et al. (2018) Marcin Junczys-Dowmunt, Roman Grundkiewicz, Tomasz Dwojak, Hieu Hoang, Kenneth Heafield, Tom Neckermann, Frank Seide, Ulrich Germann, Alham Fikri Aji, Nikolay Bogoychev, André F. T. Martins, and Alexandra Birch. 2018. Marian: Fast neural machine translation in C++. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics: System Demonstrations.
- Kocmi and Bojar (2018) Tom Kocmi and Ondřej Bojar. 2018. Trivial transfer learning for low-resource neural machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers.
- Koehn (2005) Philipp Koehn. 2005. Europarl: A Parallel Corpus for Statistical Machine Translation. In Conference Proceedings: the tenth Machine Translation Summit.
- Koehn et al. (2007) Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions.
- Lample et al. (2018) Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018. Phrase-based & neural unsupervised machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.
- Lui and Baldwin (2012) Marco Lui and Timothy Baldwin. 2012. langid.py: An off-the-shelf language identification tool. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: System Demonstrations.
- Macháček (2018) Dominik Macháček. 2018. Enriching Neural MT through Multi-Task Training. Master’s thesis, Institute of Formal and Applied Linguistics, Charles University.
- Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26. Curran Associates, Inc.
- Och and Ney (2003) Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1).
- Popel et al. (2019) Martin Popel, Dominik Macháček, Michal Auersperger, Ondřej Bojar, and Pavel Pecina. 2019. English-czech systems in wmt19: Document-level transformer. In Proceedings of the Fourth Conference on Machine Translation: Volume 2, Shared Task Papers.
- Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics: Volume 1, Long Papers.
- Stanojević and Sima’an (2014) Miloš Stanojević and Khalil Sima’an. 2014. Fitting sentence level translation evaluation with many dense features. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.
- Straková et al. (2014) Jana Straková, Milan Straka, and Jan Hajič. 2014. Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations.
- Tiedemann (2012) Jörg Tiedemann. 2012. Parallel data, tools and interfaces in opus. In Proceedings of the 8th International Conference on Language Resources and Evaluation.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30. Curran Associates, Inc.
- Wang et al. (”2016”) Weiyue Wang, Jan-Thorsten Peter, Hendrik Rosendahl, and Hermann Ney. ”2016”. Character: Translation edit rate on character level. In ”Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers”.
- Zhang et al. (2017) Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. 2017. Adversarial training for unsupervised bilingual lexicon induction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics: Volume 1, Long Papers.