Building a Neural Machine Translation System Using Only Synthetic Parallel Data

04/02/2017 ∙ by Jaehong Park, et al. ∙ Seoul National University 0

Recent works have shown that synthetic parallel data automatically generated by translation models can be effective for various neural machine translation (NMT) issues. In this study, we build NMT systems using only synthetic parallel data. As an efficient alternative to real parallel data, we also present a new type of synthetic parallel corpus. The proposed pseudo parallel data are distinct from previous works in that ground truth and synthetic examples are mixed on both sides of sentence pairs. Experiments on Czech-German and French-German translations demonstrate the efficacy of the proposed pseudo parallel corpus, which shows not only enhanced results for bidirectional translation tasks but also substantial improvement with the aid of a ground truth real parallel corpus.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Given the data-driven nature of neural machine translation (NMT), the limited source-to-target bilingual sentence pairs have been one of the major obstacles in building competitive NMT systems. Recently, pseudo parallel data, which refer to the synthetic bilingual sentence pairs automatically generated by existing translation models, have reported promising results with regard to the data scarcity in NMT. Many studies have found that the pseudo parallel data combined with the real bilingual parallel corpus significantly enhance the quality of NMT models Sennrich et al. (2015a); Zhang and Zong (2016b); Cheng et al. (2016b). In addition, synthesized parallel data have played vital roles in many NMT problems such as domain adaptation Sennrich et al. (2015a), zero-resource NMT Firat et al. (2016b), and the rare word problem Zhang and Zong (2016a).

Inspired by their efficacy, we attempt to train NMT models using only synthetic parallel data. To the best of our knowledge, building NMT systems with only pseudo parallel data has yet to be studied. Through our research, we explore the availability of synthetic parallel data as an effective alternative to the real-world parallel corpus. The active usage of synthetic data in NMT particularly has its significance in low-resource environments where the ground truth parallel corpora are very limited or not established. Even in recent approaches such as zero-shot NMT Johnson et al. (2016) and pivot-based NMT Cheng et al. (2016a), where direct source-to-target bilingual data are not required, the direct parallel corpus brings substantial improvements in translation quality where the pseudo parallel data can also be employed.

Previously suggested synthetic data, however, have several drawbacks to be a reliable alternative to the real parallel corpus. As illustrated in Figure 1

, existing pseudo parallel corpora can be classified into two groups:

source-originated and target-originated. The common property between them is that ground truth examples exist only on a single side (source or target) of pseudo sentence pairs, while the other side is composed of synthetic sentences only. The bias of synthetic examples in sentence pairs, however, may lead to the imbalance of the quality of learned NMT models when the given pseudo parallel corpus is exploited in bidirectional translation tasks (e.g., FrenchGerman and GermanFrench). In addition, the reliability of the synthetic parallel data is heavily influenced by a single translation model where the synthetic examples originate. Low-quality synthetic sentences generated by the translation model would prevent NMT models from learning solid parameters.

To overcome these shortcomings, we propose a novel synthetic parallel corpus called PSEUDOmix. In contrast to previous works, PSEUDOmix includes both synthetic and real sentences on either side of sentence pairs. In practice, it can be readily built by mixing source- and target-originated pseudo parallel corpora for a given translation task. Experiments on several language pairs demonstrate that the proposed PSEUDOmix shows useful properties that make it a reliable candidate for real-world parallel data. In detail, we make the following contributions:

  1. PSEUDOmix shows more balanced translation quality compared to existing pseudo parallel corpora in bidirectional translation tasks. For each task, it outperforms both source- and target-originated data when their performance gap is under a certain range.

  2. When fine-tuned using real parallel data, the model trained with PSEUDOmix outperforms other fine-tuned models trained with source-originated and target-originated synthetic parallel data, indicating substantial improvement in translation quality.

2 Neural Machine Translation

Given a source sentence and its corresponding target sentence

, the NMT aims to model the conditional probability

with a single large neural network. To parameterize the conditional distribution, recent studies on NMT employ the encoder-decoder architecture 

Kalchbrenner and Blunsom (2013); Cho et al. (2014b); Sutskever et al. (2014). Thereafter, the attention mechanism Bahdanau et al. (2014); Luong et al. (2015) has been introduced and successfully addressed the quality degradation of NMT when dealing with long input sentences Cho et al. (2014a).

In this study, we use the attentional NMT architecture proposed by Bahdanau et al. Bahdanau et al. (2014)

. In their work, the encoder, which is a bidirectional recurrent neural network, reads the source sentence and generates a sequence of source representations

. The decoder, which is another recurrent neural network, produces the target sentence one symbol at a time. The log conditional probability thus can be decomposed as follows:

(1)

where = (). As described in Equation (2), the conditional distribution of is modeled as a function of the previously predicted output , the hidden state of the decoder

, and the context vector

.

(2)

The context vector is used to determine the relevant part of the source sentence to predict . It is computed as the weighted sum of source representations . Each weight for implies the probability of the target symbol being aligned to the source symbol :

(3)

Given a sentence-aligned parallel corpus of size , the entire parameter of the NMT model is jointly trained to maximize the conditional probabilities of all sentence pairs :

(4)

where is the optimal parameter.

Figure 1: The process of building each pseudo parallel corpus group for French German translation. * indicates the synthetic sentences generated by translation models. Each of the source-originated and the target-originated synthetic parallel data can be made from French or German monolingual corpora. They can also be built from parallel corpora including English, which is the pivot language.

3 Related Work

In statistical machine translation (SMT), synthetic bilingual data have been primarily proposed as a means to exploit monolingual corpora. By applying a self-training scheme, the pseudo parallel data were obtained by automatically translating the source-side monolingual corpora Ueffing et al. (2007); Wu et al. (2008). In a similar but reverse way, the target-side monolingual corpora were also employed to build the synthetic parallel data Bertoldi and Federico (2009); Lambert et al. (2011). The primary goal of these works was to adapt trained SMT models to other domains using relatively abundant in-domain monolingual data.

Inspired by the successful application in SMT, there have been efforts to exploit synthetic parallel data in improving NMT systems. Source-side Zhang and Zong (2016b), target-side Sennrich et al. (2015a) and both sides Cheng et al. (2016b) of the monolingual data have been used to build synthetic parallel corpora. In their work, the pseudo parallel data combined with a real training corpus significantly enhanced the translation quality of NMT. In Sennrich et al., Sennrich et al. (2015a), domain adaptation of NMT was achieved by fine-tuning trained NMT models using a synthetic parallel corpus. Firat et al. Firat et al. (2016b) attempted to build NMT systems without any direct source-to-target parallel corpus. In their work, the pseudo parallel corpus was employed in fine-tuning the target-specific attention mechanism of trained multi-way multilingual NMT Firat et al. (2016a)

models, which enabled zero-resource NMT between the source and target languages. Lastly, synthetic sentence pairs have been utilized to enrich the training examples having rare or unknown translation lexicons 

Zhang and Zong (2016a).

4 Synthetic Parallel Data as an Alternative to Real Parallel Corpus

4.1 Motivation

As described in the previous section, synthetic parallel data have been widely used to boost the performance of NMT. In this work, we further extend their application by training NMT with only synthetic data. In certain language pairs or domains where the source-to-target real parallel corpora are very rare or even unprepared, the model trained with synthetic parallel data can function as an effective baseline model. Once the additional ground truth parallel corpus is established, the trained model can be improved by retraining or fine-tuning using the real parallel data.

4.2 Limits of the Previous Approaches

For a given translation task, we classify the existing pseudo parallel data into the following groups:

  1. [label=()]

  2. Source-originated: The source sentences are from a real corpus, and the associated target sentences are synthetic. The corpus can be formed by automatically translating a source-side monolingual corpus into the target language Zhang and Zong (2016a, b). It can also be built from source-pivot bilingual data by introducing a pivot language. In this case, a pivot-to-target translation model is employed to translate the pivot language corpus into the target language. The generated target sentences paired with the original source sentences form a pseudo parallel corpus.

  3. Target-originated: The target sentences are from a real corpus, and the associated source sentences are synthetic. The corpus can be formed by back-translating a target-side monolingual corpus into the source language Sennrich et al. (2015a). Similar to the source-originated case, it can be built from a pivot-target bilingual corpus using a pivot-to-source translation model Firat et al. (2016b).

The process of building each synthetic parallel corpus is illustrated in Figure 1. As shown in Figure 1, the previous studies on pseudo parallel data share a common property: synthetic and ground truth sentences are biased on a single side of sentence pairs. In such a case where the synthetic parallel data are the only or major resource used to train NMT, this may severely limit the availability of the given pseudo parallel corpus. For instance, as will be demonstrated in our experiments, synthetic data showing relatively high quality in one translation task (e.g., FrenchGerman) can produce poor results in the translation task of the reverse direction (GermanFrench).

Another drawback of employing synthetic parallel data in training NMT is that the capacity of the synthetic parallel corpus is inherently influenced by the mother translation model from which the synthetic sentences originate. Depending on the quality of the mother model, ill-formed or inaccurate synthetic examples could be generated, which would negatively affect the reliability of the resultant synthetic parallel data. In the previous study, Zhang and Zong Zhang and Zong (2016b) bypassed this issue by freezing the decoder parameters while training with the minibatches of pseudo bilingual pairs made from a source language monolingual corpus. This scheme, however, cannot be applied to our scenario as the decoder network will remain untrained during the entire training process.

4.3 Proposed Mixing Approach

To overcome the limitations of the previously suggested pseudo parallel data, we propose a new type of synthetic parallel corpus called PSEUDOmix. Our approach is quite straightforward: For a given translation task, we first build both source-originated and target-originated pseudo parallel data. PSEUDOmix can then be readily built by mixing them together. The overall process of building PSEUDOmix for the FrenchGerman translation task is illustrated in Figure 1.

By mixing source- and target-originated pseudo parallel data, the resultant corpus includes both real and synthetic examples on either side of sentence pairs, which is the most evident feature of PSEUDOmix. Through the mixing approach, we attempt to lower the overall discrepancy in the quality of the source and target examples of synthetic sentence pairs, thus enhancing the reliability as a parallel resource. In the following section, we evaluate the actual benefits of the mixed composition in the synthetic parallel data.

Corpus Size Avg len
Fr De
Europarl Fr-En-De 1.78M 26.00 23.16
Fr-De* 1.45M 25.56 22.98
Fr*-De 1.45M 25.32 23.46
PSEUDOmix 1.45M 25.47 23.26
Table 1: Statistics of the parallel corpora for Fr De translation tasks. The notation * denotes the synthetic part of the parallel corpus.
Corpus Fr De De Fr
newstest2011 newstest2012 newstest2013 newstest2011 newstest2012 newstest2013
Fr-De* 13.30 13.81 14.89 18.78 19.01 20.32
Fr*-De 13.81 14.52 15.20 18.46 18.73 19.82
PSEUDOmix 13.90 14.50 15.57 18.81 19.33 20.41
Table 2: Translation results (BLEU) for Fr De experiments. The notation * denotes the synthetic part of the parallel corpus. The highest BLEU for each set is bold-faced.

5 Experiments: Effects of Mixing Real and Synthetic Sentences

In this section, we analyze the effects of the mixed composition in the synthetic parallel data. Mixing pseudo parallel corpora derived from different sources, however, inevitably brings diversity, which affects the capacity of the resulting corpus. We isolate this factor by building both source- and target-originated synthetic corpora from the identical source-to-target real parallel corpus. Our experiments are performed on French (Fr) German (De) translation tasks. Throughout the remaining paper, we use the notation * to denote the synthetic part of the pseudo sentence pairs.

5.1 Data Preparation

By choosing English (En) as the pivot language, we perform pivot alignments for identical English segments on Europarl Fr-En and En-De parallel corpora Koehn (2005), constructing a multi-parallel corpus of Fr-En-De. Then each of the Fr*-De and Fr-De* pseudo parallel corpora is established from the multi-parallel data by applying the pivot language-based translation described in the previous section. For automatic translation, we utilize a pre-trained and publicly released NMT model 111 http://data.statmt.org/rsennrich/wmt16_systems for EnDe and train another NMT model for EnFr using the WMT’15 En-Fr parallel corpus Bojar et al. (2015). A beam of size 5 is used to generate synthetic sentences. Lastly, to match the size of the training data, PSEUDOmix is established by randomly sampling half of each Fr*-De and Fr-De* corpus and mixing them together.

5.2 Data Preprocessing

Each training corpus is tokenized using the tokenization script in Moses Koehn et al. (2007). We represent every sentence as a sequence of subword units learned from byte-pair encoding Sennrich et al. (2015b). We remove empty lines and all the sentences of length over 50 subword units. For a fair comparison, all cleaned synthetic parallel data have equal sizes. The summary of the final parallel corpora is presented in Table 1.

5.3 Training and Evaluation

All networks have 1024 hidden units and 500 dimensional embeddings. The vocabulary size is limited to 30K for each language. Each model is trained for 10 epochs using stochastic gradient descent with Adam 

Kingma and Ba (2014). The Minibatch size is 80, and the training set is reshuffled between every epoch. The norm of the gradient is clipped not to exceed 1.0 Pascanu et al. (2013). The learning rate is in every case.

We use the newstest 2012 set for a development set and the newstest 2011 and newstest 2013 sets as test sets. At test time, beam search is used to approximately find the most likely translation. We use a beam of size 12 and normalize probabilities by the length of the candidate sentences. The evaluation metric is case-sensitive tokenized BLEU

Papineni et al. (2002) computed with the multi-bleu.perl script from Moses. For each case, we present average BLEU evaluated on three different models trained from scratch.

Corpus Fr De De Fr
newstest2011 newstest2012 newstest2013 newstest2011 newstest2012 newstest2013
(a) Fr*-De (K=3) 13.76 14.43 15.18 - - -
(b) Fr*-De (K=5) 13.78 14.49 15.23 17.76 18.63 19.73
(a) + (b) 13.74 14.38 15.27 - - -
(c) Fr-De* (K=3) - - - 18.44 18.70 20.32
(d) Fr-De* (K=5) 13.36 14.08 15.28 18.18 18.76 20.13
(c) + (d) - - - 18.06 18.63 20.21
(b) + (d) 13.93 14.27 15.53 18.52 19.04 20.33

Table 3: Translation results (BLEU) for Fr De experiments. K denotes the beam size used to generate the corresponding synthetic parallel data. The highest BLEU for each set is bold-faced.

5.4 Results and Analysis

5.4.1 A Comparison between Pivot-based Approach and Back-translation

Before we choose the pivot language-based method for data synthesis, we conduct a preliminary experiment analyzing both pivot-based and direct back-translation. The model used for direct back-translation was trained with the ground truth Europarl Fr-De data made from the multi-parallel corpus presented in Table 2. On the newstest 2012/2013 sets, the synthetic corpus generated using the pivot approach showed higher BLEU (19.11 / 20.45) than the back-translation counterpart (18.23 / 19.81) when used in training a DeFr NMT model. Although the back-translation method has been effective in many studies Sennrich et al. (2015a, 2016), its availability becomes restricted in low-resource cases which is our major concern. This is due to the poor quality of the back-translation model built from the limited source-to-target parallel corpus. Instead, one can utilize abundant pivot-to-target parallel corpora by using a rich-resource language as the pivot language. This consequently improves the reliability of the quality of baseline translation models used for generating synthetic corpora.

5.4.2 Effects of Mixing Source- and Target-originated Synthetic Data

From Table 2, we find that the bias of the synthetic examples in pseudo parallel corpora brings imbalanced quality in the bidirectional translation tasks. Given that the source- and target-originated classification of a specific synthetic corpus is reversed depending on the direction of the translation, the overall results imply that the target-originated corpus for each translation task outperforms the source-originated data. The preference of target-originated synthetic data over the source-originated counterparts was formerly investigated in SMT by Lambert et al., Lambert et al. (2011). In NMT, it can be explained by the degradation in quality in the source-originated data owing to the erroneous target language model formed by the synthetic target sentences. In contrast, we observe that PSEUDOmix not only produces balanced results for both FrDe and DeFr translation tasks but also shows the best or competitive translation quality for each task.

We note that mixing two different synthetic corpora leads to improved BLEU not their intermediate value. To investigate the cause of the improvement in PSEUDOmix, we build additional target-originated synthetic corpora for each FrDe translation with a beam of size 3. As shown in Table 3, for the DeFr task, the new target-originated corpus (c) shows higher BLEU than the source-originated corpus (b) by itself. The improvement in BLEU, however, occurs only when mixing the source- and target-originated synthetic parallel data (b+d) compared to mixing two target-originated synthetic corpora (c+d). The same phenomenon is observed in the FrDe case as well. The results suggest that real and synthetic sentences mixed on either side of sentence pairs enhance the capability of a synthetic parallel corpus. We conjecture that ground truth examples in both encoder and decoder networks not only compensate for the erroneous language model learned from synthetic sentences but also reinforces patterns of use latent in the pseudo sentences.

Corpus Fr De De Fr
NMT SMT NMT SMT
Fr-De* 14.89 11.65 20.32 17.46
Fr*-De 15.20 12.06 19.82 17.38
PSEUDOmix 15.57 12.19 20.41 17.79
Table 4: Translation results (BLEU) for Fr De experiments evaluated on the newstest 2013 set.

5.4.3 A Comparison with Phrase-based Statistical Machine Translation

We also evaluate the effects of the proposed mixing strategy in phrase-based statistical machine translation Koehn et al. (2003). We use Moses Koehn et al. (2007) and its baseline configuration for training. A 5-gram Kneser-Ney model is used as the language model. Table 4 shows the translation results of the phrase-based statistical machine translation (PBSMT) systems. In all experiments, NMT shows higher BLEU (2.44-3.38) compared to the PBSMT setting. We speculate that the deep architecture of NMT provides noise robustness in the synthetic examples. It is also notable that the proposed PSEUDOmix outperforms other synthetic corpora in PBSMT. The results clearly show that the benefit of the mixed composition in synthetic sentence pairs is beyond a specific machine translation framework.

Corpus Size Avg length
Cs De
Europarl+NC11 0.6M 23.54 25.49
Cs-De* 3.5M 25.33 26.01
Cs*-De 3.5M 23.31 25.37
PSEUDOmix 3.5M 24.39 25.72
(a) Cs De
Corpus Size Avg length
Fr De
Europarl+NC11 1.8M 26.18 24.08
Fr-De* 3.7M 26.67 23.71
Fr*-De 3.7M 25.42 24.90
PSEUDOmix 3.7M 26.01 24.33
(b) Fr De
Table 5: Statistics of the training parallel corpora for large-scale CsDe and FrDe translation tasks.

6 Experiments: Large-scale Application

The experiments shown in the previous section verify the potential of PSEUDOmix as an efficient alternative to the real parallel data. The condition in the previous case, however, is somewhat artificial, as we deliberately match the sources of all pseudo parallel corpora. In this section, we move on to more practical and large-scale applications of synthetic parallel data. Experiments are conducted on Czech (Cs) German (De) and French (Fr) German (De) translation tasks.

6.1 Application Scenarios

We analyze the efficacy of the proposed mixing approach in the following application scenarios:

  1. [label=()]

  2. Pseudo Only: This setting trains NMT models using only synthetic parallel data without any ground truth parallel corpus.

  3. Real Fine-tuning: Once the training of an NMT model is completed in the Pseudo Only manner, the model is fine-tuned using only a ground truth parallel corpus.

The suggested scenarios reflect low-resource situations in building NMT systems. In the Real Fine-tuning, we fine-tune the best model of the Pseudo Only scenario evaluated on the development set.

6.2 Data Preparation

We use the parallel corpora from the shared translation task of WMT’15 and WMT’16 Bojar et al. (2016). Using the same pivot-based technique as the previous task, Cs-De* and Fr-De* corpora are built from the WMT’15 Cs-En and Fr-En parallel data respectively. For Cs*-De and Fr*-De, WMT’16 En-De parallel data are employed. We again use pre-trained NMT models for EnCs, EnDe, and EnFr to generate synthetic sentences. A beam of size 1 is used for fast decoding.

For the Real Fine-tuning scenario, we use real parallel corpora from the Europarl and News Commentary11 dataset. These direct parallel corpora are obtained from OPUS Tiedemann (2012). The size of each set of ground truth and synthetic parallel data is presented in Table 5. Given that the training corpus for widely studied language pairs amounts to several million lines, the Cs-De language pair (0.6M) reasonably represents a low-resource situation. On the other hand, the Fr-De language pair (1.8M) is considered to be relatively resource-rich in our experiments. The details of the preprocessing are identical to those in the previous case.

6.3 Training and Evaluation

We use the same experimental settings that we used for the previous case except for the Real Fine-tuning scenario. In the fine-tuning step, we use the learning rate of , which produced better results. Embeddings are fixed throughout the fine-tuning steps. For evaluation, we use the same development and test sets used in the previous task.

6.4 Results and Analysis

6.4.1 A Comparison with Real Parallel Data

Table 6 shows the results of the Pseudo Only scenario on CsDe and FrDe tasks. For the baseline comparison, we also present the translation quality of the NMT models trained with the ground truth Europarl+NC11 parallel corpora (a). In CsDe, the Pseudo Only scenario shows outperforming results compared to the real parallel corpus by up to 3.86-4.43 BLEU on the newstest 2013 set. Even for the FrDe case, where the size of the real parallel corpus is relatively large, the best BLEU of the pseudo parallel corpora is higher than that of the real parallel corpus by 1.3 (FrDe) and 0.49 (DeFr). We list the results on the newstest 2011 and newstest 2012 in the appendix. From the results, we conclude that large-scale synthetic parallel data can perform as an effective alternative to the real parallel corpora, particularly in low-resource language pairs.

Baseline Cs De De Cs
(a) Europarl+NC11 14.96 12.36
(b) +Pivot back-trans corpus (+4.02) 18.98 (+4.40) 16.76
Synthetic Corpus Pseudo Only Real Fine-tuning Pseudo Only Real Fine-tuning
Cs-De* 16.87 (+1.95) 18.82 15.29 (+1.21) 16.50
Cs*-De 18.62 (+0.40) 19.02 16.51 (+0.45) 16.96
PSEUDOmix 18.82 (+0.53) 19.35 16.79 (+0.68) 17.47
(a) Cs De
Baseline Fr De De Fr
(a) Europarl+NC11 17.68 22.39
(b) +Pivot back-trans corpus (+1.59) 19.27 (+1.93) 24.32
Synthetic Corpus Pseudo Only Real Fine-tuning Pseudo Only Real Fine-tuning
Fr-De* 17.57 (+1.65) 19.22 22.88 (+1.42) 24.30
Fr*-De 18.55 (+1.04) 19.59 19.87 (+4.74) 24.61
PSEUDOmix 18.98 (+0.87) 19.85 22.71 (+1.99) 24.70
(b) Fr De
Table 6: Translation results (BLEU) for Pseudo Only and Real Fine-tuning scenarios evaluated on the newstest 2013 set. For the results of the Real Fine-tuning, the values in parentheses are improvements in BLEU compared to the Pseudo Only setting. The highest BLEU for each translation task is bold-faced.
Figure 2: Translation results for the De Fr task on the newstest 2013 set with respect to the quality of the mother model for the source-originated Fr*-De data. The quality of the mother model is evaluated on the En-Fr newstest 2012 set.

6.4.2 Results from the Pseudo Only Scenario

As shown in Table 6, the model learned from the Cs*-De corpus outperforms the model trained with the Cs-De* corpus in every case. This result is slightly different from the previous case, where the target-originated synthetic corpus for each translation task reports better results than the source-originated data. This arises from the diversity in the source of each pseudo parallel corpus, which vary in their suitability for the given test set. Table 6 also shows that mixing the Cs*-De corpus with the Cs-De* corpus of worse quality brings improvements in the resulting PSEUDOmix, showing the highest BLEU for bidirectional CsDe translation tasks. In addition, PSEUDOmix again shows much more balanced performance in FrDe translations compared to other synthetic parallel corpora.

While the mixing strategy compensates for most of the gap between the Fr-De* and the Fr*-De (3.010.17) in the DeFr case, the resulting PSEUDOmix still shows lower BLEU than the target-originated Fr-De* corpus. We thus enhance the quality of the synthetic examples of the source-originated Fr*-De data by further training its mother translation model (EnFr). As illustrated in Figure 2, with the target-originated Fr-De* corpus being fixed, the quality of the models trained with the source-originated Fr*-De data and PSEUDOmix increases in proportion to the quality of the mother model for the Fr*-De corpus. Eventually, PSEUDOmix shows the highest BLEU, outperforming both Fr*-De and Fr-De* data. The results indicate that the benefit of the proposed mixing approach becomes much more evident when the quality gap between the source- and target-originated synthetic data is within a certain range.

6.4.3 Results from the Real Fine-tuning Scenario

As presented in Table 6, we observe that fine-tuning using ground truth parallel data brings substantial improvements in the translation qualities of all NMT models. Among all fine-tuned models, PSEUDOmix shows the best performance in all experiments. This is particularly encouraging for the case of DeFr, where PSEUDOmix reported lower BLEU than the Fr-De* data before it was fine-tuned. Even in the case where PSEUDOmix shows comparable results with other synthetic corpora in the Pseudo Only scenario, it shows higher improvements in the translation quality when fine-tuned with the real parallel data. These results clearly demonstrate the strengths of the proposed PSEUDOmix, which indicate both competitive translation quality by itself and relatively higher potential improvement as a result of the refinement using ground truth parallel corpora.

In Table 6 (b), we also present the performance of NMT models learned from the ground truth Europarl+NC11 data merged with the target-originated synthetic parallel corpus for each task. This is identical in spirit to the method in Sennrich et al. Sennrich et al. (2015a) which employs back-translation for data synthesis. Instead of direct back-translation, we used pivot-based back-translation, as we verified the strength of the pivot-based data synthesis in low-resource environments. Although the ground truth data is only used for the refinement, the Real Fine-tuning scheme applied to PSEUDOmix shows better translation quality compared to the models trained with the merged corpus (b). Even the results of the Real Fine-tuning on the target-originated corpus provide comparable results to the training with the merged corpus from scratch. The overall results support the efficacy of the proposed two-step methods in practical application: the Pseudo Only method to introduce useful prior on the NMT parameters and the Real Fine-tuning scheme to reorganize the pre-trained NMT parameters using in-domain parallel data.

7 Conclusion

In this work, we have constructed NMT systems using only synthetic parallel data. For this purpose, we suggest a novel pseudo parallel corpus called PSEUDOmix where synthetic and ground truth real examples are mixed on either side of sentence pairs. Experiments show that the proposed PSEUDOmix not only shows enhanced results for bidirectional translation but also reports substantial improvement when fine-tuned with ground truth parallel data. Our work has significance in that it provides a thorough investigation on the use of synthetic parallel corpora in low-resource NMT environment. Without any adjustment, the proposed method can also be extended to other learning areas where parallel samples are employed. For future work, we plan to explore robust data sampling methods, which would maximize the quality of the mixed synthetic parallel data.

References