Beyond Error Propagation in Neural Machine Translation: Characteristics of Language Also Matter

09/01/2018 ∙ by Lijun Wu, et al. ∙ Microsoft Peking University SUN YAT-SEN UNIVERSITY 0

Neural machine translation usually adopts autoregressive models and suffers from exposure bias as well as the consequent error propagation problem. Many previous works have discussed the relationship between error propagation and the accuracy drop (i.e., the left part of the translated sentence is often better than its right part in left-to-right decoding models) problem. In this paper, we conduct a series of analyses to deeply understand this problem and get several interesting findings. (1) The role of error propagation on accuracy drop is overstated in the literature, although it indeed contributes to the accuracy drop problem. (2) Characteristics of a language play a more important role in causing the accuracy drop: the left part of the translation result in a right-branching language (e.g., English) is more likely to be more accurate than its right part, while the right part is more accurate for a left-branching language (e.g., Japanese). Our discoveries are confirmed on different model structures including Transformer and RNN, and in other sequence generation tasks such as text summarization.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural machine translation (NMT) has attracted much research attention in recent years Bahdanau et al. (2014); Shen et al. (2018); Song et al. (2018); Xia et al. (2018); He et al. (2016); Wu et al. (2017, 2018). The major approach to the task typically leverages an encoder-decoder framework Cho et al. (2014); Sutskever et al. (2014) and the decoder usually generates the target tokens one by one from left to right autoregressively, in which the generation of a target token is conditioned on previously generated target tokens.

It has been observed that for an NMT model with left-to-right decoding, the right part words in its translation results are usually worse than the left part words in terms of accuracy Zhang et al. (2018); Bengio et al. (2015); Ranzato et al. (2015); Hassan et al. (2018); Liu et al. (2016b, a). This phenomenon is referred to as accuracy drop in this paper. A straightforward explanation to accuracy drop is error propagation: If a word is mistakenly predicted during inference, the error will be propagated and the future words conditioned on this one will be impacted. Different methods have been proposed to address the problem of accuracy drop (Liu et al., 2016a, b; Hassan et al., 2018).

Instead of solving the problem, in this paper, we aim to deeply understand the causes of the problem. In particular, we want to answer the following two questions:

  • Is error propagation the main cause of accuracy drop?

  • Are there any other causes leading to accuracy drop?

To answer these two questions, we conduct a series of experiments to analyze the problem.

First, we train NMT models separately using left-to-right and right-to-left decoding Sennrich et al. (2016a); Liu et al. (2016b); He et al. (2017); Gao et al. (2018) on several language pairs (i.e., German to English, English to German, and English to Chinese). If error propagation is the main cause of accuracy drop, then the right part words in the translation results generated by right-to-left NMT models should be more accurate than the left part words. However, we observe the opposite phenomenon that the accuracy of the right part words of the translated sentences in both left-to-right and right-to-left models is lower than that of the left part, which contradicts with error propagation. This shows that error propagation alone cannot well explain the accuracy drop and even suggests that error propagation may not exist or matter.

Second, to further investigate the influence of error propagation on accuracy drop, we conduct a set of experiments with teacher forcing Williams and Zipser (1989) during inference, in which we feed the ground-truth preceding words to predict the next target word. Teacher forcing eliminates exposure bias as well as error propagation in inference. The results verify the existence of error propagation, since the later part (the right part in left-to-right decoding and the left part in right-to-left decoding) of the translation results get more accuracy improvement with teacher forcing, regardless of the decoding direction. Meanwhile, the accuracy of the right part is still lower than that of the left part with teacher forcing, which demonstrates that there must be some other causes apart from error propagation leading to accuracy drop.

Third, inspired by linguistics, we find that the concept of branching  Berg et al. (2011); Payne (2006)

can help to explain the problem. We conduct the third set of experiments to study the correlation between language branching and accuracy drop. We find that if a target language is right branching such as English, the accuracy of the left part words is usually higher than that of the right part words, no matter for left-to-right or right-to-left NMT models, while for a left-branching target language such as Japanese, the accuracy of the left part words is usually lower than that of the right part, no matter for which models. The intuitive explanation is that a right-branching language has a clearer structure pattern (easier to predict) in the left part of sentence than that in the right part, since the main subject of the sentence is usually put in the left part. We calculate two statistics to verify this assumption: n-gram statistics (including n-gram frequency and conditional probabilities) and dependency parsing statistics. For right-branching languages, we found higher n-gram frequency/conditional probabilities as well as more dependencies in the left part compared with that in the right part. The opposite results are also found in left-branching languages.

We summarize our findings as follows.

  • Through empirical analyses, we find that the influence of error propagation is overstated in the literature, which may misguide the future research. Error propagation alone cannot fully explain the accuracy drop in the left or right part of sentence.

  • We find the branching in linguistics well correlates with accuracy drop in the left or right part of sentence and the corresponding analysis on n-gram and dependency parsing statistics well explain this phenomenon.

Our studies show that linguistics can be very helpful to understand existing machine learning models and build better models for language related tasks. We hope that our work can bring some insights to the research on neural machine translation. We believe that our findings can help us to design better translation models. For example, the finding on language branching suggests us to use left-to-right NMT models for right-branching languages such as English and right-to-right NMT models for left-branching languages such as Japanese.

2 Related Work

2.1 Exposure Bias and Error Propagation

Exposure bias and error propagation are two different concepts but often mentioned together in literature Bengio et al. (2015); Shen et al. (2016); Ranzato et al. (2015); Liu et al. (2016b, a); Zhang et al. (2018); Hassan et al. (2018). Exposure bias refers to the fact that the sequence generation model is usually trained with teacher-forcing while generates the sequence autoaggressviely during inference. This discrepancy between training and inference can yield errors that accumulate quickly along the generated sequence, which is known as error propagation Bengio et al. (2015); Shen et al. (2016); Ranzato et al. (2015).

Bengio et al. (2015) propose the scheduled sampling method to eliminate the exposure bias and the resulting error propagation, which achieves promising performance on sequence generation tasks such as image captioning.  Shen et al. (2016); Ranzato et al. (2015)

improve the basic maximum likelihood estimation (MLE) with reinforcement learning or minimum risk training and aim to address the limitation of MLE training and exposure bias problem.

2.2 Tackling Accuracy Drop

Liu et al. (2016b, a); Zhang et al. (2018); Hassan et al. (2018) mainly ascribe accuracy drop (the accuracy of right part words is worse than that in the left part in most cases) to error propagation and propose different methods to solve this problem.  Liu et al. (2016b, a); Hassan et al. (2018) use agreement regularization between the left-to-right and right-to-left models to achieve better performance.  Zhang et al. (2018) and Hassan et al. (2018) propose to use two-pass decoding to refine the generated sequence to yield better quality.

All these works focus on error propagation and accuracy drop. To our knowledge, there is no deep study about other causes of accuracy drop. In this paper, we aim to conduct such a study. Our study shows that accuracy drop is not only caused by error propagation, but also the characteristics of language itself.

3 Error Propagation and Accuracy Drop

3.1 Error Propagation is Not the Only Cause

A left-to-right NMT model feeds target tokens one by one from left to right in training and generate target tokens one by one from left to right during inference, while a right-to-left NMT model trains and generates token in the reverse direction. Intuitively, if error propagation is the root cause of accuracy drop, then a right-to-left NMT model will generate translations with better right half accuracy than the left half. In this section, we study the results of both left-to-right and right-to-left NMT models to analyze the relationship between error propagation and accuracy drop.

We conduct experiments on three translation tasks with different language pairs, which include: IWSLT 2014 German-English (De-En), WMT 2014 English-German (En-De) and WMT 2017 English-Chinese (En-Zh). We choose the state-of-the-art NMT model Transformer Vaswani et al. (2017) as the basic model structure and train two separate models with left-to-right and right-to-left decoding on each language pair. More details about the datasets and model descriptions can be found in supplementary materials (section A.1 and A.2). We evenly split each generated sentence into the left half and the right half with same number of words111

1) For most of the sentences, the last word of the sentence is period which is easy to decode. To make a fair comparison, we simply remove the last period before dividing the translation sentence. 2) For sentence with an odd number of words, we simply remove the word in the middle position to make the left half and right half have the same number of words.

. Then for both the left and right half, we compute their accuracy with respect to the reference target sentence, in terms of BLEU score Papineni et al. (2002) 222We use the multi-bleu.perl script https://github.com/moses-smt/mosesdecoder/scripts/generic/multi-bleu.perl. When computing BLEU score of the left or right half, the reference is the full reference sentence..

De-En En-De En-Zh
left-to-right 31.42 26.93 20.79
right-to-left 30.00 25.35 20.23
Table 1: BLEU scores on the test set of the three translation tasks with both left-to-right and right-to-left decoding.
left-to-right De-En En-De En-Zh
Left 10.17 7.90 7.41
Right 8.39 6.60 5.91
right-to-left De-En En-De En-Zh
Right 7.83 6.45 5.77
Left 9.41 7.11 7.01
Table 2: BLEU scores of the left and right half of left-to-right and right-to-left NMT models. In Liu et al. (2016a), the authors report the partial BLEU score without length penalty, our result is consistent with partial BLEU if simply removing length penalty when calculating BLEU.

We first report the BLEU scores of the full translation results (without split) in Table 1. As can be seen, the accuracy of the model is comparable to state-of-the-art results Vaswani et al. (2017); Wang et al. (2017, 2018). Afterwards we report the BLEU scores of the left half and the right half in Table 2. We have several observations.

  • When translating from left-to-right, the BLEU score of the left half is higher than the right half on all the three tasks, which is consistent with previous observation and is able to be explained via error propagation.

  • When translating from right-to-left, the accuracy of the left half (in this way it’s the later part of the generated sentence) is still higher than the right half. Such an observation is contradictory to the previous analyses between error propagation and accuracy drop, which regard that accumulated error brought by exposure bias will deteriorate the quality in later part of translation (i.e., the left half).

The inconsistent observation above suggests that error propagation is not the only cause of accuracy drop that there are other factors beyond error propagation for accuracy drop. It even challenges the existence of error propagation: does error propagation really exist? In the next section we try to answer this question through teacher forcing experiments.

3.2 The Influence of Error Propagation

Teacher forcing Williams and Zipser (1989) in sequence generation means that when training a sequence generation model, we feed the previous ground-truth tokens as inputs to predict the next target word. Here we apply teacher forcing in the inference phase of NMT: to generate the next word , we input the preceding ground-truth words rather than the previously generated words , which largely alleviates the effect of error propagation, since there will be no error propagated from the previously generated words.

De-En left-to-right right-to-left
0 1 0 1
Left 10.17 10.71 0.54 9.41 10.41 1.00
Right 8.39 9.25 0.86 7.83 8.45 0.62
En-De left-to-right right-to-left
0 1 0 1
Left 7.90 9.43 1.53 7.11 10.71 3.60
Right 6.60 8.36 1.76 6.45 8.37 1.92
En-Zh left-to-right right-to-left
0 1 0 1
Left 7.41 9.11 1.70 7.01 9.83 2.82
Right 5.91 8.55 2.64 5.77 7.54 1.77
Table 3: BLEU scores. ”0” represents the translation results without teacher forcing during inference, and ”1” represents the translation results with teacher forcing during inference. represents the BLEU score improvement of teacher forcing over normal translation.

Same as last section, we evaluate the quality of the left and right half of the translation results generated by both the left-to-right and right-to-left models. The results are summarized in Table 3. For comparison, we also include the BLEU scores of normal translation (without teacher forcing). We have several findings from Table 3 as follows:

  • Exposure bias exists. The accuracy of both left and right half tokens in the normal translation is lower than that in teacher forcing, which feeds the ground-truth tokens as inputs. This demonstrates that feeding the previously generated tokens (which might be incorrect) in inference indeed hurts translation accuracy.

  • Error propagation does exist. We find the error is accumulated along the sequential generation of the sentence. Taking En-Zh and the left-to-right NMT model as an example, the BLEU score improvement of the right half (the second half of the generation) of teacher forcing over normal translation is 2.64, which is much larger than the accuracy improvement of the left half (the first half of the generation): 1.70. Similarly, for En-Zh with the right-to-left NMT model, the BLEU score improvement of the left half (the second half of the generation) of teacher forcing over normal translation is 2.82, which is much larger than the accuracy improvement of the right half (the first half of the generation): 1.77.

  • Other causes exist. Taking En-De translation with the left-to-right model as an example, the accuracy of the left half (9.43) is higher than that of the right half (8.36) when there is no error propagation with teacher forcing. Similar results can be found in other language pairs and models. This suggests that there must be some other causes leading to accuracy drop, which will be studied in the next section.

4 Language Branching Matters

Section 3.1 and 3.2 together show that error propagation has influence on but is not the only cause of accuracy drop. We hypothesize that the language itself, i.e., its characteristics, may explain the phenomenon of accuracy drop.

Watanabe and Sumita (2002) finds that left-to-right decoding performs better for Japanese-English translation while right-to-left decoding performs better for English-Japanese translation. We conduct the same analysis settings as in Section 3.1 and 3.2 on English-Japanese (En-Jp) translation dataset. More details about this dataset and model descriptions can be found in supplementary materials (section A.1 and A.2).

Table 4 shows the BLEU score on the En-Jp test set. It can be observed that regardless of decoding direction (i.e., from left-to-right or from right-to-left) and with or without teacher forcing, the accuracy of the right half is always higher than that in the left half. This observation on Japanese is opposite to English, German and Chinese in Section 3.1 and 3.2, and motivates us to investigate the differences between these languages.

left-to-right right-to-left
0 1 0 1
left 7.90 9.91 7.45 8.95
right 8.70 11.52 9.24 10.59
Table 4: BLEU scores on En-Jp test set. ”0” represents the normal translation results, and ”1” represents the teacher-forcing translation results.

We find that a linguistics concept, the branching, can differentiate Japanese from other languages such as English/German. Branching refers to the shape of the parse trees that represent the structure of sentences Berg et al. (2011); Payne (2006). Usually, right-branching sentences are head-initial, which means the main subject of the sentence is described first, and is followed by a sequence of modifiers that provide additional information about the subject. On the contrary, left-branching sentences are head-final that putting such modifiers to the left of the sentence Payne (2006).

English is a typical right-branching language, while Japanese is almost fully left-branching Wikipedia (2018). The two languages demonstrate the opposite phenomenon of accuracy drop as shown in previous studies. When we say a language is typical left/right-branching, we mean most of the sentences in this language follows the left/right-branching structure. While being predominantly right-branching, German is less conclusively so than English. Chinese features a mixture of head-final and head-initial structures, with the noun phrases are head-final while the strict head/complement ordering sentences are head-initial as right-branching Wikipedia (2018), but less conclusively than German.

We believe the language branching is a main cause of accuracy drop. Intuitively, the main subject of a right-branching sentence is described first (in the left part) and is followed by additional modifiers (in the right part) Berg et al. (2011). Therefore, the left half of a right-branching sentence is more likely to possess a clearer structure pattern and thus lead to higher generation accuracy than in the right part, since the main subject is usually simpler and clearer than the modifiers that providing additional information about the subject. In next section, we will verify this intuition this assumption from a statistical perspective.

5 Correlation between Language Branching and Accuracy Drop

As previous work Arpit et al. (2017)

shows, neural networks are easy to learn and memorize simple patterns but difficult to make a correct prediction on noise examples. In this section, we study different branching languages from two aspects, including the n-gram statistics of a target language, which has been used as a kind of characterization of hardness of learning 

Bengio et al. (2009), and the dependency statistics in parse trees. We show that these statistics well correlate with the accuracy drop between the left half and the right half of translation results.

5.1 N-gram Statistics

Intuitively speaking, if a pattern occurs frequently and deterministically, it is easy to be learned by neural networks. By comparing the general statistics on the n-gram frequency and n-gram conditional probability of the left and right half tokens, we link the language branching to accuracy drop.

Denote a bilingual dataset , , where each is a sequence of words , is the length of . and denote the average n-gram frequency and n-gram conditional probability of the left half of  333Again, we assume is an even number. If not, we simply remove the middle word of , as done in Section 3.1., i.e.,

(1)

where and are the n-gram frequency and n-gram conditional probability calculated from the training dataset. Similarly, and denote the n-gram frequency and n-gram conditional probability of the right half.

We calculate the average n-gram frequencies and of the left half and right half over all the target sentences in the training set. We also calculate the average n-gram conditional probabilities and over all the training sentences to compare the uncertainty of phrases in the left half and right half.

(2)

We also calculate the ratio of the sentences that the frequency/conditional probability of left half is bigger/smaller than that in the right half, denoted as / and /:

(3)

We choose and to calculate the metrics in Equation 2 and 3 on different translation datasets. The numbers are listed in Table 5 and 6.

We can see the 2/3-gram frequency as well as the conditional probability of the left half is higher than that of the right half for right-branching languages including English, German and Chinese in De-En, En-De and En-Zh translation datasets. For left-branching language Japanese, the result is opposite. The n-gram frequency and conditional probability statistics are consistent with our observations on accuracy drop in left/right-branching languages and verify our hypothesis: right-branching languages have clearer patterns in left part (with larger n-gram frequency as well as the conditional probability) and consequently leads to higher translation accuracy in the left part than the right part; left-branching languages are opposite.

De-En En-De
2-gram 3-gram 2-gram 3-gram
5713.8 3122.7 13811.8 687.1
3026.5 1377.6 11692.2 419.9
59.6% 55.8% 53.8% 53.6%
38.8% 37.6% 46.0% 45.0%
20.8% 18.2% 7.8% 8.6%
En-Zh En-Jp
2-gram 3-gram 2-gram 3-gram
17707.0 1954.1 18910.0 1350.0
16256.4 1250.5 21076.7 1754.0
51.9% 50.2% 41.2% 38.0%
46.7% 43.9% 51.7% 52.3%
5.2% 6.3% -10.5% -14.3%
Table 5: The n-gram frequency statistics on different translation datasets. and represent the average of n-gram frequency of left and right half of target sentences. and represent the ratio that the n-gram frequency of left half of sentences are bigger/smaller than that of the right half. . Note that the sum of and is less than 1 since sentence with less than 4 words does not contribute to the n-gram statistics.
De-En En-De
2-gram 3-gram 2-gram 3-gram
0.137 0.181 0.082 0.155
0.092 0.116 0.080 0.148
59.8% 56.6% 50.6% 51.7%
38.7% 36.4% 49.2% 47.0%
21.2% 20.2% 1.4% 4.7%
En-Zh En-Jp
2-gram 3-gram 2-gram 3-gram
0.064 0.113 0.082 0.171
0.055 0.108 0.086 0.191
52.1% 47.8% 43.9% 39.4%
46.6% 47.0% 49.2% 50.9%
5.5% 0.8% -5.3% -11.5%
Table 6: The n-gram conditional probability statistics on different translation datasets. and represent the average n-gram conditional probability of left and right half of target sentences. and represent the ratio that the n-gram frequency of left half are bigger/smaller than that of the right half. . Note that the sum of and is less than 1 due to two reasons: (1) sentence with less than 4 words does not contribute to the statistics, and (2) we remove the n-gram condition probability with the denominator less than 100 to make probability calculation robust.
(a) Accuracy drop v.s 3-gram frequency gap (%).
(b) Accuracy drop v.s 3-gram conditional probability gap (%).
Figure 1: Accuracy drop (the gap between the left/right BLEU score) with respect to the and from Table 5 and 6 in the four translation tasks. The x-axis and represent the gap of between the left and right ratio of the 3-gram frequency/conditional probability defined in Table 5 and 6. The y-axis represents the accuracy drop in terms of BLEU score calculated by the teacher forcing decoding.

We further visualize how the accuracy drop (between the left half and right half of the translations) correlates with the gap of n-gram statistics in the left and right part. The accuracy drop (e.g., BLEU score) of left/right half is taken from the teacher-forcing with left-to-right decoding in Table 3, and the n-gram gap is taken from the in the last row of Table 5 and 6. Figure 1 shows strong correlation between accuracy drop and the gap of n-gram statistics: As the gap of n-gram statistics increases from negative values to positive values, the accuracy drop also increases from negative to positive.

5.2 Dependency Statistics

In this subsection, we study language branching from the perspective of dependency structure. We hypothesize that if the left/right half of sentence contains more dependencies between its intra words, this half should be easier to predict, leading to higher accuracy. Here we analyze the English sentence in De-En translation and Japanese sentence in En-Jp translation, since English is fully right-branching and Japanese is fully left-branching as introduced before.

For English parsing, we utilize the well-acknowledged Standford Parser444https://nlp.stanford.edu/software/lex-parser.shtml to parse the sentences. After obtaining the parsing results, we split the sentence into left and right half, and separately count the numbers of dependencies in each half555For simplicity, we just count the number of dependency, without considering dependency types. The detailed parsing formats can be found in the supplementary material (Section A.3).. For Japanese, we leverage the open-source toolkit J.DepP666http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/jdepp/ to parse the sentence, and then count the number of dependencies of each half.

English Japanese
Left 40242 921735
Right 31509 1570630
Table 7: Number of dependencies in left and right half of English (De-En) and Japanese (En-Jp) training corpus. The number varies a lot since the two training corpus have different training sentences.

We provide the results in Table 7. As can be observed, for English sentences, the left-half words depend more on each other than the right-half words, while for the Japanese sentences, the right-half words have more dependencies. This observation is consistent with our observations on accuracy drop, and can well explain the high accuracy of left part in English translation and right part in Japanese translation.

6 Extended Analyses and Discussions

We have analyzed the accuracy drop problem from the view of error propagation and language itself in previous sections. In this section, we further provide extended analyses and several discussions to give a more clear understanding of the accuracy drop problem.

6.1 More Languages on Left-Branching

The previous analyses are based on four languages, three right-branching (En, De, Zh) and one left-branching language (Jp). To avoid the experimental bias and randomness, we provide one more translation task, English-Turkish (En-Tr) translation777The detailed dataset and model description can be found in supplementary material (section A.1 and section A.2)., as Turkish is a left-branching language. We simply calculate the BLEU score of the left/right half in left-to-right and right-to-left decodings, as in Section 3.1 and 3.2.

0 1
Left 5.83 7.44
Right 5.27 7.96
Table 8: BLEU scores on En-Tr test set with left-to-right generation. Normal translation is denoted as “0”, and teacher-forcing translation is denoted as “1”.

The result is provided in Table 8. For the left-to-right decoding, the accuracy of the left half is higher than that of the right half in the normal translation. However, the accuracy of the right half becomes higher with teacher forcing translation. This demonstrates that English-Turkish translation performs similar to English-Japanese translation as the accuracy of right half is higher than that of the left half. But different from what we observed in Japanese, Turkish shows the opposite phenomenon: the influence of language branching is weaker than error propagation.

6.2 Other Model Structures

One may wonder whether the results in the paper are biased towards a certain model structure as we use Transformer on all the above analyses. To address such concerns, we conduct an additional experiment on De-En translation task with RNN (GRU)-based model888The detailed setting for GRU based RNN model can be found in supplementary material (section A.2).. The results are shown in Table 9 and the observations are consistent with what we observed on Transformer. The accuracy of the left half of the De-En translation sentence is always higher than the right half, in both the left-to-right and right-to-left decodings.

left-to-right right-to-left
Full 27.63 25.44
Left 9.17 8.37
Right 7.51 7.25
Table 9: BLEU scores on the left-to-right and right-to-left translation sentences on the De-En test set, with RNN-based model. “Full” means the BLEU score of the whole translation sentence.

6.3 Other Sequence Generation Tasks

We conduct experimental analysis on abstractive summarization, which is also a sequence generation task. The goal of the task is to recap a long news sentence into a short summary. We use Gigaword dataset which contains training pairs, validation and test pairs of English sentence, and train an RNN-based model for sentence summarization. The accuracy is measured by the commonly used metric ROUGE F1 score and are reported in Table 10.

We observe the same phenomenon as in translation tasks. The accuracy of the left half is always better than the right half, no matter in left-to-right or right-to-left decoding, since the target language English is a right-branching language.

left-to-right
ROUGE-1 ROUGE-2 ROUGE-L
Full 35.55 16.66 33.01
Left 24.44 9.87 23.34
Right 21.31 8.32 20.38
right-to-left
ROUGE-1 ROUGE-2 ROUGE-L
Full 35.22 16.55 32.59
Right 21.62 8.41 20.48
Left 23.60 9.54 22.52
Table 10: ROUGE F1 scores for left-to-right and right-to-left generated translation sentences in abstractive summarization task. ROUGE-N stands for N-gram based ROUGE F1 score, ROUGE-L stands for longest common subsequence based ROUGE F1 score. “Full” means the entire translation sentence.

7 Conclusion

In this work, we studied the problem of accuracy drop between the left half and the right half of the results generated by neural machine translation models. We found the influence of error propagation is overstated in literature and error propagation alone cannot explain accuracy drop. We showed that language branching well correlates to the accuracy drop problem and the evidences on n-gram statistics as well as the dependency statistics well support this correlation. Our discoveries suggest that left-to-right NMT models fit better for right-branching languages (e.g., English) and right-to-left NMT models fit better for left-branching languages (e.g., Japanese).

For future works, we will study more left/right-branching languages as well as other languages that have no obvious branching characteristics. We will also investigate how language branching influences other natural language tasks, especially for neural networks based models.

References

  • Arpit et al. (2017) Devansh Arpit, Stanislaw K. Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron C. Courville, Yoshua Bengio, and Simon Lacoste-Julien. 2017. A closer look at memorization in deep networks. In ICML, volume 70 of Proceedings of Machine Learning Research, pages 233–242. PMLR.
  • Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
  • Bengio et al. (2015) Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015.

    Scheduled sampling for sequence prediction with recurrent neural networks.

    In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 1171–1179.
  • Bengio et al. (2009) Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48. ACM.
  • Berg et al. (2011) Thomas Berg et al. 2011. Structure in language: A dynamic perspective. Routledge.
  • Bird et al. (2009) Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python, 1st edition. O’Reilly Media, Inc.
  • Cettolo et al. (2014) Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, and Marcello Federico. 2014. Report on the 11th iwslt evaluation campaign. In Proceedings of the 11th IWSLT.
  • Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1724–1734.
  • Gao et al. (2018) Fei Gao, Lijun Wu, Li Zhao, Tao Qin, Xueqi Cheng, and Tie-Yan Liu. 2018. Efficient sequence learning with group recurrent networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), volume 1, pages 799–808.
  • Hassan et al. (2018) Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, William Lewis, Mu Li, Shujie Liu, Tie-Yan Liu, Renqian Luo, Arul Menezes, Tao Qin, Frank Seide, Xu Tan, Fei Tian, Lijun Wu, Shuangzhi Wu, Yingce Xia, Dongdong Zhang, Zhirui Zhang, and Ming Zhou. 2018. Achieving human parity on automatic chinese to english news translation. CoRR, abs/1803.05567.
  • He et al. (2017) Di He, Hanqing Lu, Yingce Xia, Tao Qin, Liwei Wang, and Tieyan Liu. 2017. Decoding with value networks for neural machine translation. In Advances in Neural Information Processing Systems, pages 178–187.
  • He et al. (2016) Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. 2016. Dual learning for machine translation. In Advances in Neural Information Processing Systems, pages 820–828.
  • Liu et al. (2016a) Lemao Liu, Andrew M. Finch, Masao Utiyama, and Eiichiro Sumita. 2016a. Agreement on target-bidirectional lstms for sequence-to-sequence learning. In

    Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA.

    , pages 2630–2637.
  • Liu et al. (2016b) Lemao Liu, Masao Utiyama, Andrew M. Finch, and Eiichiro Sumita. 2016b. Agreement on target-bidirectional neural machine translation. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pages 411–416.
  • Nakazawa et al. (2016) Toshiaki Nakazawa, Manabu Yaguchi, Kiyotaka Uchimoto, Masao Utiyama, Eiichiro Sumita, Sadao Kurohashi, and Hitoshi Isahara. 2016. Aspec: Asian scientific paper excerpt corpus.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA., pages 311–318.
  • Payne (2006) Thomas Payne. 2006. Exploring language structure: a student’s guide. Cambridge University Press.
  • Ranzato et al. (2015) Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. CoRR, abs/1511.06732.
  • Sennrich et al. (2016a) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016a. Edinburgh neural machine translation systems for WMT 16. In Proceedings of the First Conference on Machine Translation, WMT 2016, colocated with ACL 2016, August 11-12, Berlin, Germany, pages 371–376.
  • Sennrich et al. (2016b) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016b. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers.
  • Shen et al. (2016) Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. 2016. Minimum risk training for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers.
  • Shen et al. (2018) Yanyao Shen, Xu Tan, Di He, Tao Qin, and Tie-Yan Liu. 2018. Dense information flow for neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 1294–1303.
  • Song et al. (2018) Kaitao Song, Xu Tan, Di He, Jianfeng Lu, Tao Qin, and Tie-Yan Liu. 2018. Double path networks for sequence to sequence learning. In Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, pages 3064–3074.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3104–3112.
  • Vaswani et al. (2018) Ashish Vaswani, Samy Bengio, Eugene Brevdo, François Chollet, Aidan N. Gomez, Stephan Gouws, Llion Jones, Lukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam Shazeer, and Jakob Uszkoreit. 2018. Tensor2tensor for neural machine translation. CoRR, abs/1803.07416.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 6000–6010.
  • Wang et al. (2018) Yijun Wang, Yingce Xia, Li Zhao, Jiang Bian, Tao Qin, Guiquan Liu, and Liu. 2018.

    Dual transfer learning for neural machine translation with marginal distribution regularization.

    In AAAI.
  • Wang et al. (2017) Yuguang Wang, Shanbo Cheng, Liyang Jiang, Jiajun Yang, Wei Chen, Muze Li, Lin Shi, Yanfeng Wang, and Hongtao Yang. 2017. Sogou neural machine translation systems for WMT17. In Proceedings of the Second Conference on Machine Translation, WMT 2017, Copenhagen, Denmark, September 7-8, 2017, pages 410–415.
  • Watanabe and Sumita (2002) Taro Watanabe and Eiichiro Sumita. 2002. Bidirectional decoding for statistical machine translation. In 19th International Conference on Computational Linguistics, COLING 2002, Howard International House and Academia Sinica, Taipei, Taiwan, August 24 - September 1, 2002.
  • Wikipedia (2018) Wikipedia. 2018. Head-directionality parameter — Wikipedia, the free encyclopedia. http://en.wikipedia.org/w/index.php?title=Head-directionality%20parameter&oldid=835385416. [Online; accessed 08-May-2018].
  • Williams and Zipser (1989) Ronald J. Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1(2):270–280.
  • Wu et al. (2018) Lijun Wu, Fei Tian, Tao Qin, Jianhuang Lai, and Tie-Yan Liu. 2018. A study of reinforcement learning for neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.
  • Wu et al. (2017) Lijun Wu, Yingce Xia, Li Zhao, Fei Tian, Tao Qin, Jianhuang Lai, and Tie-Yan Liu. 2017. Adversarial neural machine translation. arXiv preprint arXiv:1704.06933.
  • Xia et al. (2018) Yingce Xia, Xu Tan, Fei Tian, Tao Qin, Nenghai Yu, and Tie-Yan Liu. 2018. Model-level dual learning. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pages 5379–5388.
  • Zhang et al. (2018) Xiangwen Zhang, Jinsong Su, Yue Qin, Yang Liu, Rongrong Ji, and Hongji Wang. 2018. Asynchronous bidirectional decoding for neural machine translation. CoRR, abs/1801.05122.

Appendix A Beyond Error Propagation in Neural Machine Translation: Characteristics of Language Also Matter (Supplemental Material)

a.1 NMT Datasets

The translation datasets we used in our experiments are from five different translation tasks. The details are in the following descriptions.

1) IWSLT 2014 German-English (De-En) Cettolo et al. (2014) translation task. The dataset contains about parallel training sentences, and sentences for both validation and test set. 2) WMT 2014 English-German (En-De) translation task. The dataset contains about training pairs999https://nlp.stanford.edu/projects/nmt/, validation set and test set. 3) WMT 2017 English-Chinese (En-Zh) translation task101010http://www.statmt.org/wmt17/translation-task.html. There are nearly sentences in the training set, for both validation and test. 4) ASPEC English-Japanese (En-Jp) Nakazawa et al. (2016) translation, this corpus contains training samples, nearly for validation and test set. 5) IWSLT 2014 English-Turkish (En-Tr) Cettolo et al. (2014) translation dataset, which contains about training pairs, valid pairs and test pairs.

For the first three translations, the sentences are preprocessed using byte-pair encoding Sennrich et al. (2016b) into sub-words, while for the En-Jp translation, the sentences are on the word level. For the EN-Tr translation, the dataset is separately processed into morphological segmentation by using Zemberek111111https://github.com/orhanf/zemberekMorphTR.

a.2 NMT Models

Transformer Model

The generation model we used is Transformer Vaswani et al. (2017), which is based on the self-attention architecture. We use transofmer_small setting for De-En and En-Tr, transformer_base_v1 for En-De and En-Jp, transformer_big for En-Zh Vaswani et al. (2018). For the right-to-left model, we simply reverse the target language sentence as our training data. For example, for De-En translation, we first reverse the target English sentence, and then align the original source German sentence together with reversed English sentence as pair data for training. The models are optimized through Adam as used in the original paper Vaswani et al. (2017). During decoding phase, we generate the translation sentence by simply greedy search.

RNN Model

We also conduct experiments on RNN based models. The RNN models we adopted in Section 5 are GRU based single-layer models, which contain a bidirectional GRU encoder and a unidirectional GRU decoder. For the De-En translation task, the GRU model is a relatively small model, for which the embedding size and hidden size are both set as . For the summarization task, the embedding size of the GRU model is and the hidden size is . The models are trained by Adadelta with learning rate .

a.3 Dependency Parsing Results

The dependency parsing results for English corpus and Japanese corpus are provided here, the examples are as follows.

English parsing

For English parsing we use Stanford Parser121212https://nlp.stanford.edu/software/lex-parser.shtml together with NLTK Bird et al. (2009):

CASE: “and the great indicator of that , of course , is language loss”

PARSING:

[((u’loss’, u’NN’), u’cc’, (u’and’, u’CC’)),
((u’loss’, u’NN’), u’nsubj’, (u’indicator’, u’NN’)),
((u’indicator’, u’NN’), u’det’, (u’the’, u’DT’)),
((u’indicator’, u’NN’), u’amod’, (u’great’, u’JJ’)),
((u’indicator’, u’NN’), u’prep’, (u’of’, u’IN’)),
((u’of’, u’IN’), u’pobj’, (u’that’, u’DT’)),
((u’that’, u’DT’), u’prep’, (u’of’, u’IN’)),
((u’of’, u’IN’), u’pobj’, (u’course’, u’NN’)),
((u’loss’, u’NN’), u’cop’, (u’is’, u’VBZ’)),
((u’loss’, u’NN’), u’nn’, (u’language’, u’NN’))]

Then we can count the dependency words in one tuple that are both from the left half or the right half, e.g., ‘indicator’ depends on ‘the’ and both belong to the left half.

Japanese Parsing

For Japanese parsing we use J.DepP131313http://www.tkl.iis.u-tokyo.ac.jp/~ynaga/jdepp/:

CASE: “これ ら の 要素 と 予測 精度 の 特性 に つ い て 説明 し た 。”

PARSING:

* 0 1D@0.908514 これら 名詞,代名詞,一般,*,*,*,これら,コレラ,コレラ B@0.000000 の 助詞,連体化,*,*,*,*,の,ノ,ノ I@0.000000
* 1 4D@0.000000 要素 名詞,一般,*,*,*,*,要素,ヨウソ,ヨーソ B@0.999910 と 助詞,格助詞,一般,*,*,*,と,ト,ト I@0.000000
* 2 3D@0.993463 予測 名詞,サ変接続,*,*,*,*,予測,ヨソク,ヨソク B@0.999645 精度 名詞,一般,*,*,*,*,精度,セイド,セイド I@0.028107 の 助詞,連体化,*,*,*,*,の,ノ,ノ I@0.000000
* 3 4D@0.000000 特性 名詞,一般,*,*,*,*,特性,トクセイ,トクセイ B@0.999907 について 助詞,格助詞,連語,*,*,*,について,ニツイテ,ニツイテ I@0.000000
* 4 -1D@0.000000 説明 名詞,サ変接続,*,*,*,*,説明,セツメイ,セツメイ B@0.999984 し 動詞,自立,*,*,サ変・スル,連用形,する,シ,シ I@0.014534 た 助動詞,*,*,*,特殊・タ,基本形,た,タ,タ I@0.000878 。 記号,句点,*,*,*,*,。,。,。 I@0.001575

The two ids at the begging of each line shows the dependency words. In this case, the last token is “。” with id 4, we simply remove this token when counting the number.