Detecting and Understanding Generalization Barriers for Neural Machine Translation

by   Guanlin Li, et al.
Harbin Institute of Technology

Generalization to unseen instances is our eternal pursuit for all data-driven models. However, for realistic task like machine translation, the traditional approach measuring generalization in an average sense provides poor understanding for the fine-grained generalization ability. As a remedy, this paper attempts to identify and understand generalization barrier words within an unseen input sentence that cause the degradation of fine-grained generalization. We propose a principled definition of generalization barrier words and a modified version which is tractable in computation. Based on the modified one, we propose three simple methods for barrier detection by the search-aware risk estimation through counterfactual generation. We then conduct extensive analyses on those detected generalization barrier words on both ZhEn NIST benchmarks from various perspectives. Potential usage of the detected barrier words is also discussed.


page 1

page 2

page 3

page 4


Fine-Grained Attention Mechanism for Neural Machine Translation

Neural machine translation (NMT) has been a new paradigm in machine tran...

On Compositional Generalization of Neural Machine Translation

Modern neural machine translation (NMT) models have achieved competitive...

The 2020s Political Economy of Machine Translation

This paper explores the hypothesis that the diversity of human languages...

Beyond Noise: Mitigating the Impact of Fine-grained Semantic Divergences on Neural Machine Translation

While it has been shown that Neural Machine Translation (NMT) is highly ...

Fine-grained human evaluation of neural versus phrase-based machine translation

We compare three approaches to statistical machine translation (pure phr...

Understanding and Enhancing the Use of Context for Machine Translation

To understand and infer meaning in language, neural models have to learn...

1 Introduction

The performance of neural machine translation (NMT) models have been boosted significantly through novel architectural attempts (Gehring et al., 2017; Vaswani et al., 2017), carefully-designed learning strategies (Ott et al., 2018) and semi-supervised techniques that smartly increase the size of training corpous (Edunov et al., 2018; Ng et al., 2019). Meanwhile, by leaving architectural choice unchanged, empirical result shows that simply increasing the model capacity via delicate gradient control can lead to faster convergence and better performance (Wang et al., 2019; Zhang et al., 2019). However, all these improvements are measured in an average sense on a held-out dataset and two potential limitations may stand out. On the one hand, we should be careful about this average test performance comparison paradigm due to issues like test set overfitting (Recht et al., 2019; Mania et al., 2019)

; on the other hand, the average case analysis only covers the mean data population and does not provide much information on questions like what properties of the unseen input hinders model’s generalization, which are receiving great attention in the trustworthy machine learning community 

(Amodei et al., 2016; Jia et al., 2019).

One possible solution to mitigate the above limitations is to analyze the property of the unseen input sentence as a whole, which is an instance-level analysis instead of average analysis. This is similar to recent renaissance of out-of-distribution detection in the task of image classification (Chandola et al., 2009; Hendrycks and Gimpel, 2017; Liang et al., 2017). However, since for the task of machine translation an input sentence consists of many words, we find that the overall generalization of the model on the sentence is mostly effected by a few words and modifying them can improve translation quality largely. This phenomenon is shown in Figure 1, where by changing quēxiàn to gēnjù, the input sentence is translated much better instead. Therefore, it would be more appropriate to analyze those generalization barrier words, e.g. the words within an input sentence which hinder the overall generalization of the model on that sentence.

To this end, we firstly give a principled definition of generalization barrier in a counterfactual (Pearl and Mackenzie, 2018) way. Since the principled definition requires human evaluation, we instead provide a modified definition based on a novel statistics, which employs automatic evaluation to detect generalization barrier words. As it is costly to exactly compute this statistics, we propose three approximate estimators to inexactly calculate its value. In terms of the calculated value, we conduct experiments on two benchmarks to detect potential barriers in each unseen input sentence. In addition, we carry out systematic analyses on the detected barriers from different perspectives. We find that generalization barrier words are pervasive among different linguistic categories (Part-of-Speech) and very different from previously known troublesome source words (Zhao et al., 2018, 2019). Generalization barrier words tend to be complementary across different architectural choice. Moreover, modification of barrier words leads to more diversified hypo candidates which might be a better choice for re-ranking (Yee et al., 2019) than the top- outputs under one steady input via beam search.

2 Related Literature

Troublesome words detection To our knowledge, back to the old SMT era, Mohit and Hwa (2007) is the most related work which invents the notion of ’hard-to-translate phrase’ at source side, and uses removal to determine its effect on model generalization on other phrases’ translation, which is very similar to our usage of counterfactual generation by editing the source words. Recently, Zhao et al. (2018, 2019) are the first to detect trouble makers at source side globally for NMT. In Zhao et al. (2018), the troublesome source words are detected through an exception rate defined as the number of troublesome alignments dividing the number of , where the troublesome alignments are obtained through an extrinsic statistical aligner instead of the trained NMT model. In Zhao et al. (2019)

, the troublesome source words are constrained to words with high translation entropy which tend to be under-translated by the model. Both of their trouble detection heuristics are: 1) context-unware, globally applied on every source words without considering the context of the words, and 2) model-unaware, dependent on extrinsic statistical assumptions. In our work, we are trying to detect both context-aware and model-specific generalization barriers for every unseen source input.

Out-of-Distribution (OOD) detection OOD detection, Novelty (Markou and Singh, 2003)

, Outlier 

(Hodge and Austin, 2004)

or Anomaly Detection 

(Chandola et al., 2009) care about how likely the unseen input as a whole is to be sample different from the training distribution. This problem is recently revived on the task of image classification (Hendrycks and Gimpel, 2017; Liang et al., 2017; Choi et al., 2018). Although recently, Ren et al. (2019) starts to consider OOD detection on sequential data, i.e. gene fragments, they still regard the input feature as a holistic vehicle to cause the mismatch in underlying generative distribution. Our work is motivated from this OOD detection literature in the spirit of detecting the inputs that the model cannot generalize well upon. Beyond that, due to the structural property of the translation task, we also carry out a more fine-grained detection of causes that could be a part of the input feature, which can potentially consist of several high risky words. Notably, researchers from OOD detection recently start to focus on structure of the input and design benchmarks for such detection task for image anomaly segmentation which focuses on small patches in the image (Hendrycks et al., 2019).

Error analysis and interpretability Recently, Wu et al. (2019)

propose to conduct error analysis with three principles by heart: scalable, reproducible and counterfactual for natural language processing tasks. These principles also guide the computational consideration of our detection method. For NMT, recently,

Lei et al. (2019) are the first to focus on accurately detecting wrong and missing translation of certain source words. Different from their work which detects the unsatisfactorily translated source words themselves, our work focuses on detecting the cause of them, and serves as complementary to recent interpretability analysis of importance words (He et al., 2019).

3 Generalization Barriers

Figure 1:

A showcase of histograms of the evaluation metric value (smoothed sentence-level BLEU) at all source positions. Every histogram is drawn by collecting every possible metric value after editing (

values in total), and then BLEU spectrum 0.1-0.4 on the x-axis is divided into 50 bins. In each histogram, the orange band (if exists) shows the metric values of the edited sources that are above than the original metric value, the blue part shows the metric values that are below the original metric value. We can judge the risk of a word being a generalization barrier word by focusing on the orange band, i.e. we measure the its truncated mean as an average risk of the word.

Mainstream NMT is formulated as a sequence-to-sequence structured prediction problem and modeled and factorized as follows:


Maximum Likelihood Estimation (MLE) training is conducted on the training set , where , , with minibatch SGD to obtain an estimation of the parameter weights (Luong and Manning, 2015; Gehring et al., 2017; Vaswani et al., 2017). Like all other structured prediction problems with a scoring function and a decoding algorithm (Daumé III, 2006), for NMT, acts as the scoring function and beam search is used as the (approximate) decoding algorithm. Since beam search is a deterministic algorithm with a preset beam size, the prediction is solely determined by the input x, denoted as a map . Under this setting, we are interested in the causal question: how the input x causes the model’s failure on the prediction?

As natural language sentences, the input is a sequence with compositional structure that forms the whole semantics of itself. On the one hand, we want the NMT model to generalize well on the previous unseen input x, which means ideally it should be able to generalize well on any possible (meaningful) subsequences of x. On the other hand, the cause of the model’s generalization degradation should be attributed to some of the subsequences or their ways of composition. Therefore, one perspective to shed light on the above how question is to try to detect the set of all subsequences of x that can potentially deteriorate model’s generalization, which we dub generalization barriers. We give a principled but abstract definition of generalization barriers and its approximate but tractable version in the following subsections by treating each source word independently without considering their possible combinatorial compositions. Then we construct a statistics for each source word to represent its risk of being a generalization barrier word.

3.1 A definition with human effort

The principled definition of generalization barriers is based on the intuition that the model can potentially generalize well on some edited versions of x, i.e. with words substitution and deletion that try to partially preserve the original symbolic compositional structure (e.g. word order) and semantics of x as much as possible. This intuition also matches with the causal question we have asked before, since we are actually generating counterfactuals through intervening (editing) x (Chang et al., 2018). Specifically, if we denote a certain edited version of x as , the generalization barrier is defined as , where can be any edited version of x if satisfies the constraints in Definition 3.1. The operator returns a subsequence of x by removing their overlapped words.

Definition 3.1.

(Generalization Barriers) Given an NMT model trained on with , a distance measure , e.g. the edit distance, for an input x, we call the set of subsequences, , that satisfy the following constraints as generalization barriers of x.

  1. The distance measure is minimized;

  2. Human evaluation of the translation quality on reaches a satisfactory level.

Remarks Definition 3.1 is principled because: a) it respects the compositional nature of possible generalization barriers instead of considering only individual word; b) it handles semantic shift properly by largely preserving the original words and the word orders through minimized . The second benefit can be seen as false discovery control (Gimenez and Zou, 2019), since without this distance constraint, we can always find a well-generalizable subsequence in x by deleting most of the words.

3.2 Approximating the definition with counterfactuals

However, the above definition is also hard to scale up due to large search space and human evaluation. So we further make the following assumptions to modify it: a) the minimization of is purposefully set to , which restricts the search space tremendously by only editing one word for investigating its possibility of being a barrier; b) the human evaluation is replaced by automatic evaluation with a metric such as smoothed sentence-level BLEU (Lin and Och, 2004), since roughly leads to an unchanged reference y.

Now we investigate each source word independently by counterfactual generation as well. Instead of finding one single counterfactual which might be unsuitable for human to perceive as a natural sentence, inspired by Burns et al. (2019) and Chang et al. (2018) who edit certain patch in an image with potentially infinitely infilling patches and compute importance score of the original patch in expectation, we also generate as many edit choices as possible so that many edits may include a natural sentence. Suppose is the source vocabulary, is the set of all sentences edited from x at position . Accordingly, the size of is , which corresponds to one deletion and substitutions. Then we can actually obtain counterfactual performance measures:


based on which we can draw a histogram with binned metric values. Figure 1 is a showcase for a given input sentence, we conduct real decoding for each of the 28 words and plot the corresponding histograms which have tremendous information. We regard each histogram as a distribution of the counterfactual generalization performances.

As we can identify in Figure 1, the right orange band of a histogram (if exists) shows the counterfactuals with better generalization, and if that part dominates the distribution, we can conclude that the word being edited has a high risk of causing the degradation of generalization on x. In practice, we use the empirical truncated mean at position to represent ’s risk of being a generalization barrier word as follows:


where and . The set corresponds to the orange band in the histogram. The higher the risk, the more likely that word being a generalization barrier word. In Figure 1, the truncated mean (tm) is shown above each histogram, with ’null’ denotes that position has no orange band.

Definition 3.2.

(Generalization Barrier Words) The generalization barrier words in x are those whose reaches a satisfactory level .

In practice the hard threshold could vary for different x, so we use a soft one, the top- risky words, for deciding the potential generalization barriers.

2A risk estimator S;
3an unseen pair , position , budget , ;
4the learned NMT model ,
5the source embedding ;
The estimated truncated mean ;
1:  Initialize ;
2:  if S = Uniform then
73:     Uniformly sample elements from ,
and add them to ;
4:  else if S = Stratified then
5:     Uniformly sample elements from as ;
6:     Compute in Eq.(4) for each ;
87:     Use to choose the top- elements
in , and add them to ;
8:  else if S = Gradient-aware then
9:     Compute Eq. (5) to get ;
910:     Use to sample
elements from , and add them to ;
11:  end if
1012:  Conduct real decoding on and compute
supported on rather than .
13:  return  ;
Algorithm 1 Evaluate the risk of

3.3 Estimating the truncated mean

According to the definition in Eq.(3) and Eq.(2), one has to decode each and there are sentences in total. Unfortunately, as it takes a few seconds for each decoding, it is impractical to exactly calculate as well as As a result, we instead propose a simple yet effective algorithm as an inexact solution. The key idea to the inexact solution is to call the decoder times, with as a budget. Specifically, we randomly sample elements from to obtain a sample set . Then we calculate both and supported on . Finally we can approximately calculate by enumerating at most elements in . To randomly sample elements from , we predefine three distributions heuristically, which lead to three different estimators as follows.


A very simple unbiased estimator of

is to uniformly elements from , and compute the mean of those s that are larger than

. However, since we do not restrict the substitutions, two potential issues might lead to large variance of uniform sampling: a) waste of budget: substitutions that lead to metric values lower than

could be more; b) hardness of coverage (less concentrated): wider the range of the orange band (in the histogram of Figure 1), larger the variance.

Stratified To be less stochastic to combat variance, we can first use uniform sampling for randomly picking elements from

, and then use the loss function


as a surrogate to choose the top- from the

choices. The first stage respects the uniform distribution in

, while the second stage is deterministic (i.e., top- likelihood values) which can potentially lower the variance.

Gradient-aware To avoid the sampling budget hyper-parameter at the first stage of the stratified method, we can utilize the gradient of the original loss which guides the change of embeddings of that can minimize the loss:


Contrary to the method of adversarially modifying the input in Cheng et al. (2019), we conduct 1-step gradient update with learning rate 1.0 to minimize the original loss, and then use the normalized dot product similarity between the updated embedding and all other embeddings of the source vocabulary to bias the sampling of elements from .

The entire algorithmic procedures of the three estimators are summarized succinctly in Algorithm 1.

4 Experimental Conditions

In this section, we set up the overall experimental scenarios regarding the data configuration and the model architectural choice.

Data settings We conduct experiments on ZhEn and EnZh translation tasks using the well-known NIST benchmark. The dev and test datasets of the NIST benchmark are marked by year, e.g. NIST02 (dev), NIST03 etc. For ZhEn, each dev/test source sentence has four references; and for EnZh, we pick the first source input of the four as the source-side instance. During truncated mean estimation stage, for the ZhEn translation task, we use the first reference as the ground truth in smoothed sentence-level BLEU calculation.

Model settings We consider three types of basic model architectures proposed in Luong and Manning (2015); Gehring et al. (2017); Vaswani et al. (2017) respectively, representing the advancement of architectural inductive bias in recent years. Their average performance over NIST03, 04, 05, 06, 08 are summarized in Table 7 in Appendix A.1.

5 Analyses

5.1 Comparing the estimators 111More detailed informations about the evaluation metrics used in this subsection are in Appendix a.2.

We conduct simulation experiments among 50 unseen sentence pairs from NIST03 with whole vocabulary decoding to compute the ground truth truncated mean for each with Eq. (3), and then compare the above proposed sampling methods in terms of overlap@, variance or rank stability of the estimator under different budgets . For the stratified strategy, we set to 500 for budgets, for , for , and for . To be statistically significant, for each source word , we repeat the estimation procedure for times.

Figure 2: The overlap@ metric values over the three proposed estimation methods on the 50 samples under different budgets (5 to 5000); is set to 5, 10 and 15.
Figure 3: The rank stability of the three proposed estimation methods under different budgets. They are averaged over the 50 chosen samples and measures the variance of methods over repeated experiments.
second per sentence
budget 5 10 25 50 100 250 500 1000
time cost 4 7 17 33 65 180 360 600
Table 1: The time complexity for the uniform estimator among different budgets; note that the time cost is an average measure over each sentence.

Accuracy We use the overlap@ metric to measure the similarity between top- risky words with exact and approximate risk calculation methods. As demonstrated in Figure 2, different methods lead to very overlapped performance. And with a budget larger than 100, it can lead to an average overlap@ around 85%, based on which we think is enough for the subsequent analyses.

Variance The rank stability is measured through Kendall’s coefficient of concordance (Mazurek, 2011) which essentially calculates the similarity among different (repeat=25) ranks of each sample. The larger the value is, the more consistent among different runs the ranks stay, thus smaller variance of the estimator. As shown in Figure 3, the uniform and gradient-aware estimators have similar variance while the stratified estimator has lower variance, which might be benefited from its second deterministic stage.

Complexity We also summarize the time cost of each budget in Table 1. Since most of the time complexity comes from real decoding, here we only measure the time cost of the uniform estimator. We test the process on a single M40 GPU.

As a trade-off between accuracy, variance and time complexity, we adopt the stratified strategy with budget as our approximate detection method in all subsequent analyses. This takes around 16 hours for 1k sentences with a decent detection accuracy around 85% with respect to overlap@ and nice rank stability up to 84%.

5.2 Characterizing the Generalization Barrier Words

In this section, we try to characterize the detected barrier words from different perspectives, i.e, to understand them with their linguistic properties and their comparison with respect to other source word categorizations in statistical senses.  333We use the great toolkits LTP (che-etal-2010-ltp) and AllenNLP (gardner-etal-2018-allennlp) for the basic linguistic analyses of Chinese and English respectively.

5.2.1 Linguistic properties

POS cat. k=5 k=10 k=15 base
BPE 15.04% 14.88% 14.72% 15.33%
Noun 16.22% 16.13% 16.48% 17.63%
Prop. N. 6.19% 6.57% 6.51% 7.44%
Pron. 1.79% 2.15% 2.36% 2.35%
Verb 18.81% 18.73% 18.93% 18.36%
Adj. 2.54% 2.84% 2.93% 3.19%
Adv. 4.49% 4.35% 4.32% 4.07%
Prep. 4.42% 4.33% 4.43% 3.83%
Punc. 15.24% 14.11% 13.37% 11.44%
Q&M 4.24% 4.74% 4.65% 4.87%
C&C 1.58% 2.02% 2.06% 2.23%
(a) on NIST03 ZhEn direction
POS cat. k=5 k=10 k=15 base
BPE 10.18% 10.86% 11.33% 12.00%
Noun 21.90% 22.71% 22.44% 24.07%
Pron. 2.17% 2.26% 2.30% 2.15%
Verb 11.98% 11.41% 11.66% 11.26%
Adj. 7.15% 7.43% 7.67% 8.19%
Adv. 3.14% 3.07% 3.06% 2.93%
Prep. 13.13% 13.10% 12.74% 11.88%
Punc. 14.64% 13.03% 12.33% 10.41%
Det. 8.39% 8.91% 9.20% 9.05%
C&C 1.74% 1.81% 1.86% 2.20%
(b) on NIST03 EnZh direction
Table 2: Distribution of the detected generalization barrier words according to Part-of-Speech category.

Distribution over Part-of-Speech In this part, we summarize the distribution of the detected generalization barrier words with respect to their Part-of-Speech (POS) tags. In order to consider the subword segments, we first use a POS tagger to label on the BPE-restored corpus, and then map the non-subword segments to the corresponding POS tags while the subword segments to a special tag named BPE, so that we can readily measure the ratio of subwords. The summary statistics are shown in Table 2. To compare with the natural distribution of all the words over POS, we also demonstrate them together with the detected generalization barrier words at the base column.

For both Chinese and English source inputs, barrier words are pervasive across all POS categories, since there is no significant difference from the base distribution. Note that, functional words like preposition and punctuation increase the most (with 3 ) over the base. For English source, BPE is less tended to be barriers which indicate the benefit of subword-based segmentation. And for content words like noun and proper noun, they tend to be relatively less ambiguous and less context dependent thus tend to cause less problems.

Distance k=5 k=10 k=15 base
1 54.49% 54.08% 53.40% 51.17%
2 85.31% 85.44% 85.20% 83.75%
Table 3: The recall@ statistics with respect to the distance to all the leaves on the dependency tree (on NIST03 ZhEn direction).

Branch or leaves? In this part, we conduct an anlysis to shed light on the question that: do barrier words mostly come from main branch of the source dependency tree or modifiers? Since AllenNLP’s dependency parser re-tokenizes the original sources for English, we only provide statistics on Chinese in Table 3. Distance means distance of each word towards the leaves of the dependency tree, the base column also shows how much words are covered under certain distance. And other entries with specific means how much top- risky words (the detected barrier words) are recalled. There is a little tendency that barrier words tend to be more close to the leaves than branch.

5.2.2 Comparing to other source word categorizations

Task Word cat. k=5 k=10 k=15
ZhEn Random 21.16% 40.79% 58.09%
2-5 Frequency 20.83% 40.23% 55.62%
Entropy 22.17% 42.81% 58.51%
Exception 21.75% 42.98% 58.16%
EnZh Random 18.90% 36.90% 52.27%
2-5 Frequency 18.24% 34.86% 50.11%
Entropy 20.76% 38.17% 53.28%
Exception 18.47% 36.83% 51.89%
Table 4: The overlap@ statistics with respect to different types of troublesome word statistics methods which due not utilize real decoding.

In this part, we compare the detected generalization barrier words with other source word categorizations: a) low-frequency words; b) high translation entropy words (Zhao et al., 2019); and c) exception words (Zhao et al., 2018). They are all based on certain global statistical clues of the training corpus, i.e. alignments obtained extrinsically

. Words in a) are commonly said to cause generalization error, while words in b) and c) are dubbed as under-translated and troublesome words respectively according to the papers. Here, we want to know whether those probable trouble makers are generalization barrier words?

Since a) - c) all use global statistics for each word , to compare with the generalization barriers annotated with local risk, for each unseen input x, we also use ’s global statistical clue to annotate itself in this local context so that overlap@ can be used for comparison. The statistics are denoted as , , for inverse frequency, translation entropy and exception rate. Translation entropy of is obtained through estimating the lexical translation probability and compute the entropy of this distribution among all (target vocabulary). Exception rate of a word is calculated through the ratio between the number of exception alignment according to certain exception condition and the total number of alignment of across the training corpus, . Detailed introduction of the trouble makers is in Appendix A.3.

Table 4 shows the overlap@ values for ZhEn and EnZh. The random row shows the metric values if we randomly choose an order of the source words. It is obvious that all categorizations are very close to random, with Entropy slightly better than random, which indicates our generalization barrier words that rely statistics from inference-aware counterfactuals are very different. This highlights the novelty of such phenomenon, and implies the importance of studying generalization with explicit inference under consideration.

5.3 Context-sensitive/agnostic barriers

In this section, we try to aggregate local statistics to obtain certain global understanding: is it possible that some words are prone to be generalization barriers in a context-agnostic way or the reverse. We aggregate the top- words in each test input and calculate their count. Specifically, if one appearance of a word roughly represents a context, we can calculate the probability of certain detected barrier word of being an universal barrier according to the following barrier rate:


We then summarize the distribution of each word’s barrier rate in Figure 4. The two horizontal dashed red lines are 0.4 and 0.05, indicating highly context-agnostic and context-sensitive respectively. As you can see, there are few context-agnostic barriers and most of the barrier words are very sensitive to context, indicating the necessary of pursuing large-scale training data with abundant contexts (Schwenk et al., 2019a, b).

Figure 4: The distribution of barrier rate across words with context count larger than 10 on NIST 03-06.

5.4 Complementary across architectures

Task Arch. pair k=5 k=10 k=15
ZhEn san-fconv 28.65% 45.03% 58.42%
san-rnn 25.46% 43.73% 57.68%
fconv-rnn 27.64% 45.52% 58.85%
EnZh san-fconv 24.13% 39.62% 53.18%
san-rnn 24.80% 40.40% 53.41%
fconv-rnn 23.70% 40.11% 53.24%
Table 5: The overlap@ statistics with respect to different architectural choices (pair-wise comparison).

On the same training corpus, we train three different model architectures mentioned in Section 4 based on rnn, fconv and san respectively. To measure the similarity of the detected generalization barrier words between every two of them, we also use the overlap@ metric. Their pair-wise overlap statistics are shown in Table 5. It seems that the detected barrier words are very sensitive to architectural choice which might indicate that combining the best-practice architectures through ensemble methods might be a method for alleviating barriers specific to certain architecture.

5.5 Potential usage of the barrier words

After obtaining an understanding of the generalization barrier words, in this part, we try to present one potential usage of them which we think could be more relevant to improving the translation performance of NMT systems in an automatic way through re-ranking (Yee et al., 2019). We show that in Table 6 by editing the barrier words randomly to form a groups of inputs, we can generate a collection of re-ranking hypotheses with higher oracle performance, better translation recall and diversity than top- candidates from one single input, which is currently the common wisdom of re-ranking for NMT, i.e. using top- scored beam search candidates as outputs. Actually, we find that, the usual top- candidates are very similar to each other and the oracle translation seems to be a paraphrased version of the highest model-scored one which might be very hard for the reranking model to pick up, instead the candidates generated by editing barriers can recall the actual incorrectly or un-translated parts of meaning of the source. Details for the measures used here are introduced in Appendix A.4.

Task Candidate Oracle Coverage Diversity
ZhEn top- 39.40 78.22 61.78
barrier 42.78 (+3.38) 83.88 (+5.66) 57.21 (-4.75)
EnZh top- 32.31 72.10 59.98
barrier 37.48 (+5.17) 79.62 (+7.52) 52.72 (-7.26)
Table 6: The comparison of various properties of the re-ranking candidates generated through the traditional top- and our barrier-editing methods.

6 Conclusion and Future Work

In this paper, we identify and define a new phenomenon in NMT named generalization barriers through inference-aware counterfactual analyses. Simple approximate methods are investigated to better detect such generalization barrier words. After large-scale detection on held-out test sets, we find that barrier words are pervasive among different POS categories and mostly prone to be functional words. However, they are very different from previous identified trouble makers in the source side. Moreover, barrier words are tend to be more context-sensitive and less universal. We can potentially alleviate them through ensembles of different architectures (Athiwaratkun et al., 2019) or editing them for constructing better re-ranking candidates (Yee et al., 2019). Future work involves fundamental causal analysis of the emergence of such phenomenon intrinsically through the lens of the learned representation and representation confounding effect (Li et al., 2019) or extrinsically through compositionality study of the input.


  • D. Amodei, C. Olah, J. Steinhardt, P. F. Christiano, J. Schulman, and D. Mané (2016) Concrete problems in ai safety. ArXiv abs/1606.06565. External Links: Link Cited by: §1.
  • B. Athiwaratkun, M. Finzi, P. Izmailov, and A. G. Wilson (2019) There are many consistent explanations of unlabeled data: why you should average. Cited by: §6.
  • C. Burns, J. Thomason, and W. Tansey (2019) Interpreting black box models with statistical guarantees. arXiv preprint arXiv:1904.00045. External Links: Link Cited by: §3.2.
  • V. Chandola, A. Banerjee, and V. Kumar (2009) Anomaly detection: a survey. ACM computing surveys (CSUR) 41 (3), pp. 15. External Links: Link Cited by: §1, §2.
  • C. Chang, E. Creager, A. Goldenberg, and D. Duvenaud (2018)

    Explaining image classifiers by counterfactual generation

    External Links: Link Cited by: §3.1, §3.2.
  • Y. Cheng, L. Jiang, and W. Macherey (2019) Robust neural machine translation with doubly adversarial inputs. External Links: Link Cited by: §3.3.
  • H. Choi, E. Jang, and A. A. Alemi (2018) WAIC, but why? generative ensembles for robust anomaly detection. External Links: Link Cited by: §2.
  • H. Daumé III (2006) Practical structured learning techniques for natural language processing. Ph.D. Thesis, University of Southern California, Los Angeles, CA. External Links: Link Cited by: §3.
  • S. Edunov, M. Ott, M. Auli, and D. Grangier (2018) Understanding back-translation at scale. In EMNLP, External Links: Link Cited by: §1.
  • J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin (2017) Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1243–1252. External Links: Link Cited by: §1, §3, §4.
  • J. R. Gimenez and J. Zou (2019) Discovering conditionally salient features with statistical guarantees. Long Beach, California, USA, pp. 2290–2298. External Links: Link Cited by: §3.1.
  • S. He, Z. Tu, X. Wang, L. Wang, M. Lyu, and S. Shi (2019) Towards understanding neural machine translation with word importance. Hong Kong, China, pp. 953–962. External Links: Link, Document Cited by: §2.
  • D. Hendrycks, S. Basart, M. Mazeika, M. Mostajabi, J. Steinhardt, and D. Song (2019) A benchmark for anomaly segmentation. arXiv preprint arXiv:1911.11132. External Links: Link Cited by: §2.
  • D. Hendrycks and K. Gimpel (2017)

    A baseline for detecting misclassified and out-of-distribution examples in neural networks

    Proceedings of International Conference on Learning Representations. External Links: Link Cited by: §1, §2.
  • V. Hodge and J. Austin (2004)

    A survey of outlier detection methodologies

    Artificial intelligence review 22 (2), pp. 85–126. External Links: Link Cited by: §2.
  • R. Jia, A. Raghunathan, K. Göksel, and P. Liang (2019) Certified robustness to adversarial word substitutions. Hong Kong, China, pp. 4120–4133. External Links: Link, Document Cited by: §1.
  • W. Lei, W. Xu, A. T. Aw, Y. Xiang, and T. S. Chua (2019) Revisit automatic error detection for wrong and missing translation – a supervised approach. Hong Kong, China, pp. 942–952. External Links: Link, Document Cited by: §2.
  • K. Li, T. Zhang, and J. Malik (2019) Approximate feature collisions in neural nets. pp. 15816–15824. Cited by: §6.
  • S. Liang, Y. Li, and R. Srikant (2017) Enhancing the reliability of out-of-distribution image detection in neural networks. External Links: Link Cited by: §1, §2.
  • C. Lin and F. J. Och (2004) Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. External Links: Link Cited by: §A.4, §3.2.
  • M. Luong and C. D. Manning (2015) Stanford neural machine translation systems for spoken language domains. External Links: Link Cited by: §3, §4.
  • H. Mania, J. Miller, L. Schmidt, M. Hardt, and B. Recht (2019) Model similarity mitigates test set overuse. pp. 9993–10002. External Links: Link Cited by: §1.
  • M. Markou and S. Singh (2003) Novelty detection: a review—part 1: statistical approaches. Signal processing 83 (12), pp. 2481–2497. External Links: Link Cited by: §2.
  • J. Mazurek (2011) EVALUATION of ranking similarity in ordinal ranking problems. External Links: Link Cited by: §A.2, §5.1.
  • B. Mohit and R. Hwa (2007) Localization of difficult-to-translate phrases. Prague, Czech Republic, pp. 248–255. External Links: Link Cited by: §2.
  • N. Ng, K. Yee, A. Baevski, M. Ott, M. Auli, and S. Edunov (2019) Facebook fair’s wmt19 news translation task submission. arXiv preprint arXiv:1907.06616. External Links: Link Cited by: §1.
  • M. Ott, S. Edunov, D. Grangier, and M. Auli (2018) Scaling neural machine translation. In WMT, External Links: Link Cited by: §1.
  • J. Pearl and D. Mackenzie (2018) The book of why: the new science of cause and effect. 1st edition, Basic Books, Inc., New York, NY, USA. External Links: ISBN 046509760X, 9780465097609, Link Cited by: §1.
  • B. Recht, R. Roelofs, L. Schmidt, and V. Shankar (2019)

    Do ImageNet classifiers generalize to ImageNet?

    In Proceedings of the 36th International Conference on Machine LearningAdvances in Neural Information Processing Systems 32Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)Proceedings of the 33rd Annual Conference on Uncertainty in Artificial Intelligence (UAI)NIPSProceedings of the 35th International Conference on Machine LearningICLR 2019Proceedings of the 2018 Conference on Empirical Methods in Natural Language ProcessingVol 33 (2019): Proceedings of the Thirty-Third AAAI Conference on Artificial IntelligenceICLRarXivAdvances in Neural Information Processing Systems 32ICMLNIPSProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)International Workshop on Spoken Language TranslationProceedings of the 36th International Conference on Machine LearningACLICLRACLActa academica karviniensiaProceedings of Association for Machine Translation in the AmericasProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)Advances in Neural Information Processing SystemsProceedings of the Second Workshop on Statistical Machine TranslationProceedings of the 57th Annual Meeting of the Association for Computational LinguisticsICLR, K. Chaudhuri, R. Salakhutdinov, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, R. Garnett, J. Dy, A. Krause, K. Chaudhuri, and R. Salakhutdinov (Eds.), Proceedings of Machine Learning ResearchProceedings of Machine Learning ResearchProceedings of Machine Learning Research, Vol. 978097, Long Beach, California, USA, pp. 5389–5400. External Links: Link Cited by: §1.
  • J. Ren, P. J. Liu, E. Fertig, J. Snoek, R. Poplin, M. A. DePristo, J. V. Dillon, and B. Lakshminarayanan (2019) Likelihood ratios for out-of-distribution detection. pp. 14680–14691. External Links: Link Cited by: §2.
  • H. Schwenk, V. Chaudhary, S. Sun, H. Gong, and F. Guzmán (2019a) Wikimatrix: mining 135m parallel sentences in 1620 language pairs from wikipedia. arXiv preprint arXiv:1907.05791. External Links: Link Cited by: §5.3.
  • H. Schwenk, G. Wenzek, S. Edunov, E. Grave, and A. Joulin (2019b) CCMatrix: mining billions of high-quality parallel sentences on the web. arXiv preprint arXiv:1911.04944. External Links: Link Cited by: §5.3.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. External Links: Link Cited by: §1, §3, §4.
  • Q. Wang, B. Li, T. Xiao, J. Zhu, C. Li, D. F. Wong, and L. S. Chao (2019) Learning deep transformer models for machine translation. arXiv preprint arXiv:1906.01787. External Links: Link Cited by: §1.
  • T. Wu, M. T. Ribeiro, J. Heer, and D. Weld (2019) Errudite: scalable, reproducible, and testable error analysis. Florence, Italy, pp. 747–763. External Links: Link, Document Cited by: §2.
  • K. Yee, Y. Dauphin, and M. Auli (2019) Simple and effective noisy channel modeling for neural machine translation. Hong Kong, China, pp. 5700–5705. External Links: Link, Document Cited by: §1, §5.5, §6.
  • B. Zhang, I. Titov, and R. Sennrich (2019) Improving deep transformer with depth-scaled initialization and merged attention. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 898–909. External Links: Link, Document Cited by: §1.
  • Y. Zhao, J. Zhang, Z. He, C. Zong, and H. Wu (2018) Addressing troublesome words in neural machine translation. pp. 391–400. External Links: Link Cited by: §A.3, §A.3, §1, §2, §5.2.2.
  • Y. Zhao, J. Zhang, C. Zong, Z. He, H. Wu, et al. (2019) Addressing the under-translation problem from the entropy perspective. External Links: Link Cited by: §A.3, §A.3, §1, §2, §5.2.2.

Appendix A Appendices

a.1 Mean performance

Task Model Train Dev. Test Avg.
ZhEn rnn 35.02 41.02 37.73
fconv 40.02 45.58 43.04
san 38.46 47.85 45.17
EnZh rnn 39.12 22.57 16.61
fconv 40.91 24.96 18.43
san 41.67 26.31 19.50
Table 7: The average sense generalization performance results on NIST benchmark measured by BLEU; note that here Train is measured through single reference while Dev. is measured by four references for the ZhEn task, so for rnn, Dev. can surpass Train.

a.2 Evaluation metrics

overlap@ The first metric we use for evaluating the accuracy of the estimated risk is based on the overlap@ metric (dong-etal-2018-confidence). Since each source word is annotated with a risk via exactly or approximately generating couterfactuals. The risks then induce a ranking among the source words. According to our Definition 3.2, the top- risky words are treated as generalization barrier words. So given two rankings of the same input, we can choose their top- risky words and measure how they overlap with each other. Formally, given two ranked list of words of the input x based on two list of risks, and are their top- risky words, the overlap@ metric is as follows:


Kendall’s coefficient concordance The second metric for evaluting rank stability (variance) is called Kendall’s coefficient of concordance (Mazurek, 2011). It is computed through the following formula:


where is the number of rankings and the number of objects. In our setting, is 25 corresponding to the 25 repeats of the simulation and is the source sentence length corresponding to the length of ranks on all the source words.

a.3 Definition of troublesome words

In Section 5.2.2, we measure the similarity between our identified generalization barrier words and previouly proposed under-translated words (Zhao et al., 2019) and troublesome words (Zhao et al., 2018). Here, we give a detailed introduction to the definition of them.

Under-translated words The under-translated word  (Zhao et al., 2019) is defined as the word with its translation entropy larger than certain threshold. Each word’s translation entropy is calculated from its translation probabilities which are count-based estimated from word alignments of the training set obtained through certain statistical word aligner, e.g. fast_align (dyer-etal-2013-simple). That is, for each , , where . So we can use of each word to annotate each source sentence with every word with a global risk.

Troublesome words The troublesome word  (Zhao et al., 2018) is defined as word that satisfies certain exception condition, which is measured through an exception rate . Here, is the number of alignment pair for any , across the whole corpus obtained as well with fast_align; is the number of exception alignment pair where has violated certain conditions. Zhao et al. (2018) proposes three exception conditions which result in similar performance, so here we use only one of them for experiment. That is, the word probability falls below certain threshold . The same with the under-translated word, we use to label each source word.

a.4 Measures for evaluating the re-ranking candidates

In Section 5.5, we use three measures to characterize the candidates generated by top-1 beam search from several randomly edited sources via barrier words and commonly used top- beam search results from the original source input. Here, we give a detailed description of those measures. We denote the hypo candidates generated from source-editing top-1 beam search and top- beam search as and . To be fair, the two collections of hypo candidates have same size, that is .

Oracle Given the reference , a set of candidates (), the oracle value of is:


where the function BLEU denotes the sentence-level smoothed BLEU (Lin and Och, 2004) in all our experiments. The larger the oracle value is, the better the candidates are.

Coverage Given the reference , a set of candidates (), the coverage value of is:


where denotes the different 1-grams of the sentence . The layer the coverage value is, the better the candidates are.

Diversity Given a set of candidates (), the diversity value of is:


where . That is we use the sentence-level smoothed BLEU for comparing the difference between any two candidates and average them all. So the smaller the diversity value is, the better the candidates are.