Corpus Augmentation by Sentence Segmentation for Low-Resource Neural Machine Translation

05/22/2019 ∙ by Jinyi Zhang, et al. ∙ Microsoft 0

Neural Machine Translation (NMT) has been proven to achieve impressive results. The NMT system translation results depend strongly on the size and quality of parallel corpora. Nevertheless, for many language pairs, no rich-resource parallel corpora exist. As described in this paper, we propose a corpus augmentation method by segmenting long sentences in a corpus using back-translation and generating pseudo-parallel sentence pairs. The experiment results of the Japanese-Chinese and Chinese-Japanese translation with Japanese-Chinese scientific paper excerpt corpus (ASPEC-JC) show that the method improves translation performance.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural Machine Translation (NMT) has produced remarkable results with large-scale parallel corpus. However, for low-resource languages for domain-defined translation tasks, the parallel corpus scale is small. Accordingly, the translation performance is reduced considerably Koehn and Knowles (2017). Therefore, the study of NMT under conditions of low-resource language corpora has high practical value.

As described in this paper, we propose a corpus augmentation method by segmenting long sentences into partial sentences of the corpus using back-translation and generating pseudo-parallel sentence pairs. The larger corpus can improve translation performance, in experiments on the Japanese–Chinese scientific paper excerpt corpus (ASPEC-JC) as the low-resource corpus, the translation results over the baseline both have better translation performance in Japanese-Chinese and Chinese-Japanese directions, respectively.

The main contributions of this paper are the following. We demonstrate the ability to improve the translation performance of NMT systems by mixing generated pseudo-parallel sentence pairs into training data with no monolingual data, and without changing the neural network architecture. This capability makes our approach applicable to different NMT architectures.

2 Related Work

Expanding the number of parallel corpora is an effective means of improving the translation quality for NMT in low-resource languages. The parallel corpus can be constructed quickly using back-translation with monolingual target data Sennrich et al. (2015). One study reported by Sennrich et al. (2017) also showed that even simply duplicating the monolingual target data and using them as the source data was sufficient to realize some benefits. Moreover, a pseudo-parallel corpus can be constructed using the copy method, i.e., the target language sentences are copied as the corresponding source language sentences Currey et al. (2017), which illustrates that even poor translations can be beneficial. Data augmentation for low-frequency words has also been proven an effective method Fadaee et al. (2017).

For back-translation method, Gwinnup et al. (2017) implemented their NMT system with iteratively applying back-translation. Lample et al. (2018) explored the use of generated back-translated data, aided by denoising with a language model trained on the target side. Translation performance can also be improved by iterative back-translation in both high-resource and low-resource scenarios Poncelas et al. (2018). A more refined idea of back-translation is the dual learning approach of He et al. (2016), which integrates training on parallel data and training on monolingual data via round-tripping.

3 NMT and ASPEC-JC Corpus

For this research, we follow the NMT architecture by Luong et al. (2015)

, which implements as a global attentional encoder–decoder neural network with Long Short-Term Memory (LSTM). We simply use it at the character level, because the translation results have better performance than the word-level between Japanese and Chinese. However, it is noteworthy that our proposed method is not specific to this architecture.

We conducted experiments with the ASPEC-JC corpus, which was constructed by manually translating Japanese scientific papers into Chinese Nakazawa et al. (2016). ASPEC-JC comprises four parts: training data (672,315 sentence pairs), development data (2,090 sentence pairs), development-test data (2,148 sentence pairs) and test data (2,107 sentence pairs) on the assumption that they would be used for machine translation research.

We chose ASPEC-JC as the low-resource corpus compared with other language pairs such as English-French, which usually comprises millions of parallel sentences. ASPEC-JC corpus only has about 672k sentences. We randomly extracted 300k sentence pairs from the training data for experiments.

4 Corpus Augmentation by Long Sentence Segmentation

Sennrich et al. (2015) proposed a method to extend parallel corpora by back-translating target language sentences in monolingual corpora to obtain pseudo-source sentences; the pseudo-source sentences together with the original target sentences are then added to the parallel corpus.

Our method expands the existing parallel corpus with itself, not with any monolingual data, not like some back-translation methods with monolingual data Sennrich et al. (2015) Currey et al. (2017) Fadaee et al. (2017). Moreover, our method could be combined with other corpus augmentation methods. Our augmentation process includes the following phases: 1) splitting ‘long’ parallel sentence pairs of the corpus into parallel partial sentence pairs, 2) back-translating the target partial sentences, and 3) constructing parallel sentence pairs by combining the source and the back-translated target partial sentences. To be precise, a ‘long’ sentence above means a sentence that contains more than one punctuation marks.

4.1 Generating bilingual partial sentences

The following procedure generates parallel partial sentence pairs from long parallel sentence pairs.

Figure 1: Example of word alignment information and sentence segments.
Figure 2: Examples of generated parallel partial sentences.
  1. Obtain the word alignment information from tokenized Japanese-Chinese parallel sentences.

  2. Split the long parallel sentences into segments at the punctuation symbols, such as “,”, “;”, “:”. Figure 1 presents an example of the word alignment information and the segments of a sentence pair.

  3. Obtain source-target segment alignments: For each source segment and target segment , count the words in that correspond to the words in according to the word alignment information. The numerical values on the arrows in Figure 1 represent the rate of the correspondence relation between the segments. We infer that corresponds to if the rate is greater than or equal to a threshold value . In this research, we set .

  4. Obtain target-source segment alignments: According to the procedure in 3.

  5. Concatenate multiple segments to form a one-to-one relation if there is a one-to-many or many-to-many relation between the segments.

In Figure 2, each sentence is divided into three segments. Thereby, two parallel partial sentences are generated.

Method # # back- JC CJ
sentences translated BLEU (%) BLEU (%)
Baseline 300k(JC),300k(CJ) 0 38.7 37.9
Copied 952k(JC),952k(CJ) 0 39.2 (+0.5) 39.8 (+1.9)
Partial 984k(JC),984k(CJ) 0 39.2 (+0.5) 39.2 (+1.3)
Back-translation 518k(JC),518k(CJ) 218k(JC),218k(CJ) 39.4 (+0.7) 39.4 (+1.5)
Proposed 952k(JC),952k(CJ) 218k(JC),218k(CJ) 39.5 (+0.8) 40.1 (+2.2)
Table 2: Experiment results of 300k training data with 372k monolingual data. Translation directions are designated as JapaneseChinese (JC, JC) and ChineseJapanese (CJ, CJ) in the table.
Method # # back- JC CJ
sentences translated BLEU (%) BLEU (%)
300k+mono 672k(JC),672k(CJ) 372k(JC),372k(CJ) 39.6 39.9
+Proposed 2,255k(JC),2,200k(CJ) 508k(JC),513k(CJ) 40.1(+0.5) 41.1(+1.2)
Table 1: Experiment results of 300k training data. Translation directions are designated as JapaneseChinese (JC, JC) and ChineseJapanese (CJ, CJ) in the table.

4.2 Corpus augmentation by generated bilingual partial sentences

Using the generated parallel partial sentences, pseudo-parallel sentences are constructed according to the following procedure.

  1. Back-translate the target partial sentences into source language with a translation model built from parallel data.

  2. Create a pseudo-source sentence that is partly different from the original source sentence by replacing a part of the original sentence with a partial sentence obtained through back-translation. For example, if a sentence is divided into three partial sentences, then three pseudo-source sentences will be created.

  3. Copy the target sentences corresponding to the created pseudo-source sentences to produce pseudo-parallel sentences.

  4. Add the generated pseudo-parallel sentences to the original parallel corpus.

5 Evaluation and Discussion

5.1 Experiment settings

We follow the NMT architecture by Luong et al. (2015) and implement the NMT architecture using OpenNMT Klein et al. (2017). The model has one layer with 512 cells; the embedding size is 512. The parameters are uniformly initialized in (

), using plain SGD, starting with a learning rate of 1 until epoch 6, and subsequently 0.5 times for each epoch. The max-batch size is 100. The normalized gradient is rescaled whenever its norm exceeds 1. Because of the amounts of training data (300k as the baseline) is relatively small, the dropout probability is set as 0.5 to avoid overfitting. Decoding is performed by beam search with a beam size of 5. We segment the Chinese and Japanese sentences into words using Jieba

111 and Mecab 222 We employed fast_align to obtain word alignment information, which was symmetrized using the included atool command333

The average of BLEU scores from validation perplexity (perplexity with dev data) stopped point to epoch 16 was taken as the evaluation BLEU value.

5.2 Experiment results and discussion

The translation results are presented in Table 2 (for 300k sentence pairs). “Baseline” is a character-level translation with the 300k original training data. The back-translation models for corpus augmentation are constructed using the 300k original training data of “Baseline”. “Copied” is the method that adds duplicate copies of both the source and target sides of the training data as the same times as the proposed method does. The experiment of this method aims to highlight differences between the generated pseudo-parallel sentences pairs and unchanged sentences pairs. “Partial” is the method that augments the corpus with parallel partial sentences generated by the procedure in Section 4.1, without back-translating and mixing the partial sentences. The experiment of this method aims to confirm the mixing step (Section 4.1, step 2) is necessary. This method expands the parallel corpus from 300k sentence pairs to 984k sentence pairs in both directions. “Back-translation” is the back-translation method that back-translates the same data as the proposed method does (218k from original training data). The experiment of this method aims to compare proposed method with the back-translation method (Sennrich et al. (2015)) on the same back-translated data. “# sentences” in the tables denotes the size (the number of sentence pairs) of training data, whereas “# back-translated” denotes the number of parallel sentence pairs used for back-translation processing, i.e., the corpus augmentation, in each method.

Although the generated pseudo-source sentences have translation errors and unnatural expressions, the BLEU scores were higher than “Copied”, ‘Partial” and “Back-translation” in both directions: JC and CJ. These results demonstrated that the proposed method is effective for extending the small-scale parallel corpus to improve NMT performance.

The experiments described above prove the effectiveness of the proposed method. Nevertheless, our approach is based on only the original parallel data and does not require any additional monolingual data, unlike back-translation method of Sennrich et al. (2015). Most methods of corpus augmentation are applied to pair monolingual training data with automatic back-translation and then treat them as additional parallel training data. Therefore, we have added comparison experiments.

We conducted a comparison experiment using 300k sentences as the original data and the remaining 372k sentences as the monolingual data.

Translation results of comparison experiment are presented in Table 2. “+Proposed” back-translates 508k and 513k from the “300k+mono” (672k training data), so that the numbers of sentence pairs are increased from 672k to 2,255k and 2,200k in both directions. The proposed method produced higher BLEU scores than the original monolingual method. These comparison experiments demonstrate that our proposed method can augment the extended data by the other corpus augmentation methods to yield better translation performance. In the future we plan to combine the proposed methods with other augmentation approaches as our results suggest it may be more beneficial than only back-translation. Salient benefits of the proposed method are that it requires no monolingual data and that, without changing the neural network architecture, our method can generate more pseudo-parallel sentences. Moreover, it can be combined with other augmentation methods.

6 Conclusion

In this paper, we proposed a simple but effective approach to augment the NMT corpus for low-resource language pairs by segmenting long sentences in the corpus, using back-translation, and generating pseudo-parallel sentences pairs. We demonstrated that this approach engenders generation of more pseudo-parallel sentences. Consequently, we obtained higher translation quality for NMT. Future studies should include more comparative experiments using other language pairs with different amounts of data.