Neural Machine Translation (NMT) has produced remarkable results with large-scale parallel corpus. However, for low-resource languages for domain-defined translation tasks, the parallel corpus scale is small. Accordingly, the translation performance is reduced considerably Koehn and Knowles (2017). Therefore, the study of NMT under conditions of low-resource language corpora has high practical value.
As described in this paper, we propose a corpus augmentation method by segmenting long sentences into partial sentences of the corpus using back-translation and generating pseudo-parallel sentence pairs. The larger corpus can improve translation performance, in experiments on the Japanese–Chinese scientific paper excerpt corpus (ASPEC-JC) as the low-resource corpus, the translation results over the baseline both have better translation performance in Japanese-Chinese and Chinese-Japanese directions, respectively.
The main contributions of this paper are the following. We demonstrate the ability to improve the translation performance of NMT systems by mixing generated pseudo-parallel sentence pairs into training data with no monolingual data, and without changing the neural network architecture. This capability makes our approach applicable to different NMT architectures.
2 Related Work
Expanding the number of parallel corpora is an effective means of improving the translation quality for NMT in low-resource languages. The parallel corpus can be constructed quickly using back-translation with monolingual target data Sennrich et al. (2015). One study reported by Sennrich et al. (2017) also showed that even simply duplicating the monolingual target data and using them as the source data was sufficient to realize some benefits. Moreover, a pseudo-parallel corpus can be constructed using the copy method, i.e., the target language sentences are copied as the corresponding source language sentences Currey et al. (2017), which illustrates that even poor translations can be beneficial. Data augmentation for low-frequency words has also been proven an effective method Fadaee et al. (2017).
For back-translation method, Gwinnup et al. (2017) implemented their NMT system with iteratively applying back-translation. Lample et al. (2018) explored the use of generated back-translated data, aided by denoising with a language model trained on the target side. Translation performance can also be improved by iterative back-translation in both high-resource and low-resource scenarios Poncelas et al. (2018). A more refined idea of back-translation is the dual learning approach of He et al. (2016), which integrates training on parallel data and training on monolingual data via round-tripping.
3 NMT and ASPEC-JC Corpus
For this research, we follow the NMT architecture by Luong et al. (2015)
, which implements as a global attentional encoder–decoder neural network with Long Short-Term Memory (LSTM). We simply use it at the character level, because the translation results have better performance than the word-level between Japanese and Chinese. However, it is noteworthy that our proposed method is not specific to this architecture.
We conducted experiments with the ASPEC-JC corpus, which was constructed by manually translating Japanese scientific papers into Chinese Nakazawa et al. (2016). ASPEC-JC comprises four parts: training data (672,315 sentence pairs), development data (2,090 sentence pairs), development-test data (2,148 sentence pairs) and test data (2,107 sentence pairs) on the assumption that they would be used for machine translation research.
We chose ASPEC-JC as the low-resource corpus compared with other language pairs such as English-French, which usually comprises millions of parallel sentences. ASPEC-JC corpus only has about 672k sentences. We randomly extracted 300k sentence pairs from the training data for experiments.
4 Corpus Augmentation by Long Sentence Segmentation
Sennrich et al. (2015) proposed a method to extend parallel corpora by back-translating target language sentences in monolingual corpora to obtain pseudo-source sentences; the pseudo-source sentences together with the original target sentences are then added to the parallel corpus.
Our method expands the existing parallel corpus with itself, not with any monolingual data, not like some back-translation methods with monolingual data Sennrich et al. (2015) Currey et al. (2017) Fadaee et al. (2017). Moreover, our method could be combined with other corpus augmentation methods. Our augmentation process includes the following phases: 1) splitting ‘long’ parallel sentence pairs of the corpus into parallel partial sentence pairs, 2) back-translating the target partial sentences, and 3) constructing parallel sentence pairs by combining the source and the back-translated target partial sentences. To be precise, a ‘long’ sentence above means a sentence that contains more than one punctuation marks.
4.1 Generating bilingual partial sentences
The following procedure generates parallel partial sentence pairs from long parallel sentence pairs.
Obtain the word alignment information from tokenized Japanese-Chinese parallel sentences.
Split the long parallel sentences into segments at the punctuation symbols, such as “,”, “;”, “:”. Figure 1 presents an example of the word alignment information and the segments of a sentence pair.
Obtain source-target segment alignments: For each source segment and target segment , count the words in that correspond to the words in according to the word alignment information. The numerical values on the arrows in Figure 1 represent the rate of the correspondence relation between the segments. We infer that corresponds to if the rate is greater than or equal to a threshold value . In this research, we set .
Obtain target-source segment alignments: According to the procedure in 3.
Concatenate multiple segments to form a one-to-one relation if there is a one-to-many or many-to-many relation between the segments.
In Figure 2, each sentence is divided into three segments. Thereby, two parallel partial sentences are generated.
|sentences||translated||BLEU (%)||BLEU (%)|
|Copied||952k(JC),952k(CJ)||0||39.2 (+0.5)||39.8 (+1.9)|
|Partial||984k(JC),984k(CJ)||0||39.2 (+0.5)||39.2 (+1.3)|
|Back-translation||518k(JC),518k(CJ)||218k(JC),218k(CJ)||39.4 (+0.7)||39.4 (+1.5)|
|Proposed||952k(JC),952k(CJ)||218k(JC),218k(CJ)||39.5 (+0.8)||40.1 (+2.2)|
|sentences||translated||BLEU (%)||BLEU (%)|
4.2 Corpus augmentation by generated bilingual partial sentences
Using the generated parallel partial sentences, pseudo-parallel sentences are constructed according to the following procedure.
Back-translate the target partial sentences into source language with a translation model built from parallel data.
Create a pseudo-source sentence that is partly different from the original source sentence by replacing a part of the original sentence with a partial sentence obtained through back-translation. For example, if a sentence is divided into three partial sentences, then three pseudo-source sentences will be created.
Copy the target sentences corresponding to the created pseudo-source sentences to produce pseudo-parallel sentences.
Add the generated pseudo-parallel sentences to the original parallel corpus.
5 Evaluation and Discussion
5.1 Experiment settings
We follow the NMT architecture by Luong et al. (2015) and implement the NMT architecture using OpenNMT Klein et al. (2017). The model has one layer with 512 cells; the embedding size is 512. The parameters are uniformly initialized in (
), using plain SGD, starting with a learning rate of 1 until epoch 6, and subsequently 0.5 times for each epoch. The max-batch size is 100. The normalized gradient is rescaled whenever its norm exceeds 1. Because of the amounts of training data (300k as the baseline) is relatively small, the dropout probability is set as 0.5 to avoid overfitting. Decoding is performed by beam search with a beam size of 5. We segment the Chinese and Japanese sentences into words using Jieba111http://github.com/fxsjy/jieba and Mecab 222http://taku910.github.io/mecab. We employed fast_align to obtain word alignment information, which was symmetrized using the included atool command333http://github.com/clab/fast_align.
The average of BLEU scores from validation perplexity (perplexity with dev data) stopped point to epoch 16 was taken as the evaluation BLEU value.
5.2 Experiment results and discussion
The translation results are presented in Table 2 (for 300k sentence pairs). “Baseline” is a character-level translation with the 300k original training data. The back-translation models for corpus augmentation are constructed using the 300k original training data of “Baseline”. “Copied” is the method that adds duplicate copies of both the source and target sides of the training data as the same times as the proposed method does. The experiment of this method aims to highlight differences between the generated pseudo-parallel sentences pairs and unchanged sentences pairs. “Partial” is the method that augments the corpus with parallel partial sentences generated by the procedure in Section 4.1, without back-translating and mixing the partial sentences. The experiment of this method aims to confirm the mixing step (Section 4.1, step 2) is necessary. This method expands the parallel corpus from 300k sentence pairs to 984k sentence pairs in both directions. “Back-translation” is the back-translation method that back-translates the same data as the proposed method does (218k from original training data). The experiment of this method aims to compare proposed method with the back-translation method (Sennrich et al. (2015)) on the same back-translated data. “# sentences” in the tables denotes the size (the number of sentence pairs) of training data, whereas “# back-translated” denotes the number of parallel sentence pairs used for back-translation processing, i.e., the corpus augmentation, in each method.
Although the generated pseudo-source sentences have translation errors and unnatural expressions, the BLEU scores were higher than “Copied”, ‘Partial” and “Back-translation” in both directions: JC and CJ. These results demonstrated that the proposed method is effective for extending the small-scale parallel corpus to improve NMT performance.
The experiments described above prove the effectiveness of the proposed method. Nevertheless, our approach is based on only the original parallel data and does not require any additional monolingual data, unlike back-translation method of Sennrich et al. (2015). Most methods of corpus augmentation are applied to pair monolingual training data with automatic back-translation and then treat them as additional parallel training data. Therefore, we have added comparison experiments.
We conducted a comparison experiment using 300k sentences as the original data and the remaining 372k sentences as the monolingual data.
Translation results of comparison experiment are presented in Table 2. “+Proposed” back-translates 508k and 513k from the “300k+mono” (672k training data), so that the numbers of sentence pairs are increased from 672k to 2,255k and 2,200k in both directions. The proposed method produced higher BLEU scores than the original monolingual method. These comparison experiments demonstrate that our proposed method can augment the extended data by the other corpus augmentation methods to yield better translation performance. In the future we plan to combine the proposed methods with other augmentation approaches as our results suggest it may be more beneficial than only back-translation. Salient benefits of the proposed method are that it requires no monolingual data and that, without changing the neural network architecture, our method can generate more pseudo-parallel sentences. Moreover, it can be combined with other augmentation methods.
In this paper, we proposed a simple but effective approach to augment the NMT corpus for low-resource language pairs by segmenting long sentences in the corpus, using back-translation, and generating pseudo-parallel sentences pairs. We demonstrated that this approach engenders generation of more pseudo-parallel sentences. Consequently, we obtained higher translation quality for NMT. Future studies should include more comparative experiments using other language pairs with different amounts of data.
- Currey et al. (2017) Anna Currey, Antonio Valerio Miceli Barone, and Kenneth Heafield. 2017. Copied Monolingual Data Improves Low-Resource Neural Machine Translation. In Proceedings of the Second Conference on Machine Translation, pages 148–156. Association for Computational Linguistics.
- Fadaee et al. (2017) Marzieh Fadaee, Arianna Bisazza, and Christof Monz. 2017. Data Augmentation for Low-Resource Neural Machine Translation. In Proc. 55th Annual Meeting of the Assoc. for Computational Linguistics (Volume 2: Short Papers), pages 567–573, Vancouver, Canada.
- Gwinnup et al. (2017) Jeremy Gwinnup, Timothy Anderson, Grant Erdmann, Katherine Young, Michaeel Kazi, Elizabeth Salesky, Brian Thompson, and Jonathan Taylor. 2017. The AFRL-MITLL WMT17 Systems: Old, New, Borrowed, BLEU. In Proceedings of the Second Conference on Machine Translation, pages 303–309. Association for Computational Linguistics.
- He et al. (2016) Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tie-Yan Liu, and Wei-Ying Ma. 2016. Dual learning for machine translation. In Advances in Neural Information Processing Systems 29, pages 820–828. Curran Associates, Inc.
- Klein et al. (2017) Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. 2017. OpenNMT: Open-Source Toolkit for Neural Machine Translation. CoRR, abs/1701.02810.
- Koehn and Knowles (2017) Philipp Koehn and Rebecca Knowles. 2017. Six Challenges for Neural Machine Translation. CoRR, abs/1706.03872.
- Lample et al. (2018) Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018. Phrase-Based & Neural Unsupervised Machine Translation. CoRR, abs/1804.07755.
Luong et al. (2015)
Minh Thang Luong, Hieu Pham, and Christopher D. Manning. 2015.
to Attention-based Neural Machine Translation.
Proc. 2015 Conf. on Empirical Methods in Natural Language Processing, pages 1412–1421, Lisbon, Portugal. ACL.
- Nakazawa et al. (2016) Toshiaki Nakazawa, Manabu Yaguchi, Kiyotaka Uchimoto, Masao Utiyama, Eiichiro Sumita, Sadao Kurohashi, and Hitoshi Isahara. 2016. ASPEC: Asian Scientific Paper Excerpt Corpus. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France.
- Poncelas et al. (2018) Alberto Poncelas, Dimitar Shterionov, Andy Way, Gideon Maillette de Buy Wenniger, and Peyman Passban. 2018. Investigating Backtranslation in Neural Machine Translation. CoRR, abs/1804.06189.
- Sennrich et al. (2017) Rico Sennrich, Alexandra Birch, Anna Currey, Ulrich Germann, Barry Haddow, Kenneth Heafield, Antonio Valerio Miceli Barone, and Philip Williams. 2017. The University of Edinburgh’s Neural MT Systems for WMT17. CoRR, abs/1708.00726.
- Sennrich et al. (2015) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Improving Neural Machine Translation Models with Monolingual Data. CoRR, abs/1511.06709.