Evaluating Explanation Methods for Neural Machine Translation

by   Jierui Li, et al.

Recently many efforts have been devoted to interpreting the black-box NMT models, but little progress has been made on metrics to evaluate explanation methods. Word Alignment Error Rate can be used as such a metric that matches human understanding, however, it can not measure explanation methods on those target words that are not aligned to any source word. This paper thereby makes an initial attempt to evaluate explanation methods from an alternative viewpoint. To this end, it proposes a principled metric based on fidelity in regard to the predictive behavior of the NMT model. As the exact computation for this metric is intractable, we employ an efficient approach as its approximation. On six standard translation tasks, we quantitatively evaluate several explanation methods in terms of the proposed metric and we reveal some valuable findings for these explanation methods in our experiments.


page 1

page 2

page 3

page 4


Incorporating Discrete Translation Lexicons into Neural Machine Translation

Neural machine translation (NMT) often makes mistakes in translating low...

Otem&Utem: Over- and Under-Translation Evaluation Metric for NMT

Although neural machine translation(NMT) yields promising translation pe...

When do Contrastive Word Alignments Improve Many-to-many Neural Machine Translation?

Word alignment has proven to benefit many-to-many neural machine transla...

On Compositional Generalization of Neural Machine Translation

Modern neural machine translation (NMT) models have achieved competitive...

Neural Machine Translation with Imbalanced Classes

We cast neural machine translation (NMT) as a classification task in an ...

Fairwashing Explanations with Off-Manifold Detergent

Explanation methods promise to make black-box classifiers more transpare...

The Solvability of Interpretability Evaluation Metrics

Feature attribution methods are popular for explaining neural network pr...

1 Introduction

Neural machine translation (NMT) has witnessed great success during recent years Sutskever et al. (2014); Bahdanau et al. (2014); Gehring et al. (2017); Vaswani et al. (2017)

. One of the main reasons is that neural networks possess the powerful ability to model sufficient context by entangling all source words and target words from translation history. The downside yet is its poor interpretability: it is unclear which specific words from the entangled context are crucial for NMT to make a translation decision. As interpretability is important for understanding and debugging the translation process and particularly to further improve NMT models, many efforts have been devoted to explanation methods for NMT 

Ding et al. (2017); Alvarez-Melis and Jaakkola (2017); Li et al. (2019); Ding et al. (2019); He et al. (2019)

. However, little progress has been made on evaluation metric to study how good these explanation methods are and which method is better than others for NMT.

Generally speaking, we recognize two orthogonal dimensions for evaluating the explanation methods: i) how much the pattern (such as source words) extracted by an explanation method matches human understanding

on predicting a target word; or ii) how the pattern matches

predictive behavior of the NMT model on predicting a target word. In terms of i), Word Alignment Error Rate (AER) can be used as a metric to evaluate an explanation method by measuring agreement between human-annotated word alignment and that derived from the explanation method. However, AER can not measure explanation methods on those target words that are not aligned to any source words according to human annotation.

In this paper, we thereby make an initial attempt to measure explanation methods for NMT according to the second dimension of interpretability, which covers all target words. The key to our approach can be highlighted as fidelity

: when extracting the most relevant words with an explanation method, if those relevant words have the potential to construct an optimal proxy model that agrees well with the NMT model on making a translation decision, then this explanation method is good (§3). To this end, we formalize a principled evaluation metric as an optimization problem over the expected disagreement between the optimal proxy model and the NMT model(§3.1). Since it is intractable to exactly calculate the principled metric for a given explanation method, we propose an approximate metric to address the optimization problem. Specifically, inspired by statistical learning theory 

Vapnik (1999)

, we cast the optimization problem into a standard machine learning problem which is addressed in a two-step strategy: firstly we follow empirical risk minimization to optimize the empirical risk; then we validate the optimized parameters on a held-out test dataset. Moreover, we construct different proxy model architectures by utilizing the most relevant words to make a translation decision, leading to variant approximate metric in implementation (§3.2).

We apply the approximate metric to evaluate four explanation methods including attention Bahdanau et al. (2014); Vaswani et al. (2017), gradient norm Li et al. (2016), weighted gradient Ding et al. (2019) and prediction difference Li et al. (2019). We conduct extensive experiments on three standard translation tasks for two popular translation models in terms of the proposed evaluation metric. Our experiments reveal valuable findings for these explanation methods: 1) The evaluation methods (gradient norm and prediction difference) are good to interpret the behavior of NMT; 2) The prediction difference performs better than other methods.

This paper makes the following contributions:

  • It presents an attempt at evaluating the explanation methods for neural machine translation from a new viewpoint of fidelity.

  • It proposes a principled metric for evaluation, and to put it into practice it derives a simple yet efficient approach to approximately calculate the metric.

  • It quantitatively compares several different explanation methods and evaluates their effects in terms of the proposed metric.

2 NMT and Explanation Methods

2.1 NMT Models

Suppose denotes a source sentence with length and

is a target sentence. Most NMT literature models the following conditional probability

in an encoder-decoder fashion:


where denotes a prefix of with length , and

is the decoding state vector of timestep

. In the encoding stage, the encoder of a NMT model transforms the source sentence into a sequence of hidden vectors . In the decoding stage, the decoder module summarizes the hidden vectors and the history decoding states into the decoding state vector . In this paper, we consider two popular NMT translation architectures, Rnn-Search Bahdanau et al. (2014) and Transformer Vaswani et al. (2017). Rnn-Search utilizes a bidirectional RNN to define and it computes by the attention function over , i.e.,


where is the attention function, which is defined as follows:


where and are vectors, is a similarity function over a pair of vectors and is its normalized function.

Different from Rnn-Search, which relies on Rnn, Transformer employs an attention network to define , and two additional attention networks to define as follows: 111Due to space limitation, we present the notations for a single layer NMT models, and for Transformer we only keep the attention (with a single head) block while skipping other blocks such as resNet and layer normalization. More details can be found in the references Vaswani et al. (2017).


2.2 Explanation Methods

In this section, we describe several popular explanation methods that will be evaluated with our proposed metric. Suppose denotes the context at timestep , (or ) denotes either a source or a target word in the context . According to Poerner et al. (2018), each explanation method for NMT could be regarded as a word relevance score function , where indicates that is more useful for the translation decision than word .


Since  Bahdanau et al. (2014) propose the attention mechanism for NMT, it has been the most popular explanation method for NMT Tu et al. (2016); Mi et al. (2016); Liu et al. (2016); Zenkel et al. (2019).

To interpret Rnn-Search and Transformer, we define different for them based on attention. For Rnn-Search, since attention is only defined on source side, can be defined only for the source words:

where is the attention weight defined in Eq.(3), and is the decoding state of Rnn-Search defined in Eq.(2). In contrast, Transformer defines the attention on both sides and thus is not constrained to source words:

where and are defined in Eq.(4).


Different from attention that is restricted to a specific family of networks, the explanation methods based on gradient are more general. Suppose denotes the gradient of w.r.t to the variable in :


where denotes the gradient w.r.t the embedding of the word , since a word itself is discrete and can not be taken gradient. Therefore, returns a vector with the same shape as the embedding of . In this paper, we implement two different gradient-based explanation methods and derive different definitions of as follows.

  • Gradient Norm Li et al. (2016): The first definition of is the norm of :

  • Weighted Gradient Ding et al. (2019): The second one is defined as the weighted sum of the embedding of , with the return of as the weight:

It is worth noting that for each sentence , one has to independently calculate for each timestep . Therefore, one has to calculate times of gradient for each sentence. In contrast, when training NMT, one only requires calculating sentence level gradient and it only calculates one gradient thanks to gradient accumulation in back propagation algorithm.

Prediction Difference

Li et al. (2019) propose a prediction difference (Pd) method, which defines the contribution of the word by evaluating the change in the probability after removing from . Formally, based on prediction difference is defined as follows:

where is the NMT probability of defined in Eq.(1), and denotes the NMT probability of after excluding from its context . To achieve the effect of excluding from , it simply replaces the word embedding of with zero vector before feeding it into the NMT model.

3 Evaluation Methodology

3.1 Principled Metric

The key to our metric is described as follow: to define an explanation method good enough in terms of our metric, the relevant words selected by from the context should have the potential to construct an optimal model that exhibits similar behavior to the target model . To formalize this metric, we first specify some necessary notations.

Assume that is the target word predicted by , i.e., . In addition, let be the top- relevant words on the source side and target side of the context :

where denotes the union of two sets, and returns words corresponding to the largest values.  222In fact, can be considered as generalized translation rules obtained by . In other words, the rules are extracted under teacher forcing decoding. In particular, if , this is similar to the statistical machine translation (SMT) with word level rules Koehn (2009), except that a generalized translation rule also involves a word from which simulates the role of language modeling in SMT.

In addition, suppose ( or for brevity) is a proxy model that makes a translation decision on top of rather than the entire context like a standard NMT model. Formally, we define a principled metric as follows:

Definition 1

The metric of is defined by


where denotes the expectation with respect to the data distribution of , and is minimized over all possible proxy models.

The underlying idea of the above metric is to measure the expectation of the disagreement between an optimal proxy model constructed from and the NMT model . Here the disagreement is measured by the minus log-likelihood of over the data whose label is generated from 333It is natural to extend our definition by using other similar disagreement measures such as the KL distance. Since the KL distance requires additional GPU memory to restore the distribution in the implementation, we employ the minus log-likelihood for efficiency in our experiments.

Definition of Fidelity

The metric of actually defines fidelity by measuring how much the optimal proxy model defined on disagrees with . The mention of fidelity is widely used in model compression Buciluǎ et al. (2006); Polino et al. (2018), model distillation Hinton et al. (2015); Liu et al. (2018), and particularly in evaluating the explanation models for black-box neural networks Lakkaraju et al. (2016); Bastani et al. (2017). These works focus on learning a specific model on which fidelity can be directly defined. However, we are interested in evaluating explanation methods where is a latent variable that we have to minimize. By doing this, fidelity in our metric is defined on as shown in Eq (6).

3.2 Approximation

Generally, it is intractable to exactly calculate the principled metric due to two main challenges. On one hand, the real data distribution of is unknowable, making it impossible to exactly define the expectation with respect to an unknown distribution. On the other hand, the domain of a proxy model is not bounded, and it is difficult to minimize a model within an unbounded domain.

Empirical Risk Minimization

Inspired by the statistical learning theory Vapnik (1999), we calculate the expected disagreement over by a two-step strategy: we minimize the empirical risk to obtain an optimized for a given

; and then we estimate the risk defined on a held-out test set by using the optimized

. In this way, we cast the principled metric into a standard machine learning task.

For a given model architecture , to optimize , we first collect the training set as for each sentence pair at every time step , where is a sentence pair from a given bilingual corpus . Then we optimize by the empirical risk minimization:


Proxy Model Selection

In response to the second challenge of the unbounded domain, we define a surrogate distribution family , and then approximately calculate Eq.(6) within instead:


We consider three different proxy models including multi-layer feedforward network (FN), recurrent network (RN) and self-attention network (SA). In details, for different networks , the proxy model is defined as follows:

where is the decoding state regarding different architecture . Specifically, for feedforward network, the decoding state is defined by

For , the decoding state is defined by

where and are source and target side words from , is the query of init state, is the position-aware representations of words, generated by the encoder of RN or SA as defined in Eq.(3) and Eq.(4). For RN, is the weight-sum vectors of a bidirectional LSTM over all selected top source and target words; while for SA, is the weight-sum of vectors over the SA networks.

3.3 Evaluation Paradigm

0:  , , ,
0:  the metric score of over
2:  Collect from and to obtain two sets and
3:  for  do
4:     Optimize over w.r.t Eq.(7)
5:     Add into
6:  end for
7:  for  do
9:     for  do
11:     end for
12:  end for
13:  Return
Algorithm 1 Calculating the evaluation metric

Given a bilingual training set and a bilingual test set , we evaluate an explanation method w.r.t the NMT model by setting the proxy model family to include three neural networks as defined before. Following the standard process of addressing a machine learning problem, Algorithm 1 summarizes the procedure to approximately calculate the metric of on the test dataset , which returns the preplexity (PPL) on 444Note that the negative log-likelihood in Eq. 6 is proportional to PPL and thus we use PPL as the metric value in this paper.

In this paper, we try four different choices to specify the surrogate family, i.e., , , , and , leading to four instances of our metric respectively denoted as FN, RN, SA and Comb. In addition, as the baseline metric, we employ the well-trained NMT model as the proxy model by masking out the input words that do not appear in the rule set . For the baseline metric, it doesn’t require to train parameter and tests on only. Since is trained with the entire context whereas it is testified on , this mismatch may lead to poor performance and is thus less trusted. This baseline metric extends the idea of  Arras et al. (2016); Denil et al. (2014) from classification tasks to structured prediction tasks like machine translation which are highly dependent on context rather than just keywords.

4 Experiments

In this section, we conduct experiments to prove the effectiveness of our metric from two viewpoints: how good an explanation method is and which explanation method is better than others.

4.1 Settings


We carry out our experiments on three standard IWSLT translation tasks including IWSLT14 DeEn (167k sentence pairs), IWSLT17 ZhEn (237k sentence pairs) and IWSLT17 FrEn (229k sentence pairs). All these datasets are tokenized and applied BPE (Byte-Pair Encoding) following  Ott et al. (2019). The target side vocabulary sizes of the three datasets are 8876, 11632, and 9844 respectively. In addition, we carry out extended experiments on three large-scale WMT translation tasks including WMT14 DeEn (4.5m sentence pairs), WMT17 ZhEn (22m sentence pairs) and WMT14 FrEn (40.8m sentence pairs), with vocabulary sizes 22568, 29832, 27168 respectively.

NMT Systems

To examine the generality of our evaluation method, we conduct experiments on two NMT systems, i.e. Rnn-Search (denoted by RNN) and Transformer (denoted by Trans.), both of which are implemented with fairseq Ott et al. (2019). For RNN, we adopt the 1-layer RNN with LSTM cells whose encoder (bi-directional) and decoder hidden units are 256 and 512 respectively. For Transformer on the IWSLT datasets, the number of layers and attention heads are 2 and 4 respectively. For both models, we set the embedding dimensions as 256. On WMT datasets, we simply use Transformer-base with 4 attention heads. The performances of our NMT models are comparable to those reported in recent literature Tan et al. (2019).

Explanation Methods

On both NMT systems, we implement four explanation methods, i.e. Attention (Attn), gradient norm (Ngrad), weighted gradient (Wgrad), and prediction difference (Pd) as mentioned in Section §2.

Our metric

We implemented five instantiations of the proposed metric including FN, RN, SA, Comb, and Baseline (Base for brevity) as presented in section §3.3. To configurate them, we adopt the same settings from NMT systems to train SA and RN. FN is implemented with feeding the features of bag of words through a 3-layer fully connected network. As given in algorithm 1, the approximate fidelity is estimated through with the lowest PPL, therefore the best metric is that achieves the lowest PPL since it results in a closer approximation to the real fidelity.

4.2 Experiments on IWSLT tasks

In this subsection, we first conduct experiments and analysis on the IWSLT DeEn task to configurate fidelity-based metric and then extend the experiments to other IWSLT tasks.

Comparison of metric instantiations

NMT Metric Attn Pd Ngrad Wgrad
Trans Base 196.9 54.3 193.4 13400
FN 13.9 5.8 11.3 131.2
RN 13.8 5.7 10.7 126.7
SA 13.9 5.5 10.8 119.5
Comb 13.8 5.5 10.7 119.5

Base - 54.2 90.3 28587
FN - 6.7 8.3 170.8
RN - 6.5 7.8 163.2
SA - 6.5 8.1 154.9
Comb - 6.5 7.8 154.9
Table 1: The PPL comparison for the five metric instantiations on the IWSLT DeEn dataset.

We calculate PPL on the IWSLT DeEn dataset for four metric instantiations (FN, RN, SA, Comb) and Baseline (Base) with to extract the most relevant words. Table 1 summarizes the results for two translation systems (Transformer annotated as Trans and Rnn-Search annotated as RNN), respectively. Note that since there is no target-side attention in Rnn-Search, we can not extract the best relevant target word, so Table 1 does not include the results of Attn method for Rnn-Search.

The baseline (Base) achieves undesirable PPL which indicates the relevant words identified by Pd failed to make the same decision as the NMT system. The main reason is that the mismatch between training and testing leads to the issue as presented in section §3.3. On the contrary, the other four metric instantiations attain much lower PPL than the Baseline. In addition, the PPLs on Pd, Ngrad, and Attn are much better than those on Wgrad. This finding shows that all Pd, Ngrad, and Attn are good explanation methods except Wgrad in terms of fidelity.

Density of generalizable rules

To understand possible reasons for why one explanation method is better under our metric, we make a naive conjecture: when it tries to reveal the patterns that the well-trained NMT has captured, it extracted more concentrated patterns. In other words, a generalized rule from one sentence pair can often be observed among other examples.

To measure the density of the extracted rules, we first divide all extracted rules into five bins according to their frequencies. Then we collect the number of rules in each bin as well as the total number of rules. Table 2 shows the statistics to measure the density of rules obtained from different evaluation methods. From this table, we can see that the density for Pd is the highest among those for all explanation methods, because it contains fewer infrequent rules in , whereas there are more frequent rules in other bins. This might be one possible reason that Pd is better under our fidelity-based evaluation metric.

Method Total
Attn 1.97M 1.65M 298K 23.7K 1.54K 104
Pd 1.62M 1.25M 328K 31.2K 2.11K 108
Ngrad 1.89M 1.54M 326K 27.6K 1.64K 83
Wgrad 2.62M 2.37M 278K 17.5K 0.86K 34
Table 2: Density of the extracted rules from Transformer on the IWSLT DeEn . The density is measured by the total number of unique rules and the number of rules with certain frequency in each interval : , , , , and ).

Stability of ranking order

In Table 1 the ranking order is Pd Ngrad Attn Wgrad regarding all five metric instantiations. Generally, a good metric should preserve the ranking order of explanation methods independent of the test dataset. Regarding this criterion of order-preserving property, we analyze the stability of different fidelity-based metric instantiations. To this end, we randomly sample one thousand test data with replacement whose sizes are variant from 1% to 100% and then calculate the rate whether the ranking order is preserved on these test datasets. The results in Table 3 indicate that FN, RN, SA, Comb are more stable than Base to the change of distribution of test sets.

According to Table 1 and Table 3, SA performs similar to the best metric Comb and it is faster than Comb or RN for training and testing, thereby, in the rest of experiments, we mainly employ SA to measure evaluation methods.

Base FN SA RN Comb
1% 53.0% 97.1% 99.9% 99.8% 99.8%
5% 56.1% 100% 100% 100% 100%
20% 60.8% 100% 100% 100% 100%
50% 66.8% 100% 100% 100% 100%
100% 75.4% 100% 100% 100% 100%
Table 3: The rate (percentage) of sampled test dataset that have the same rankings as the test set on the IWSLT ZhEn dataset.

Effects on different

In this experiment, we examine the effects of explanation methods on larger with respect to SA. Figure 1 depicts the effects of for Transformer on DeEn task. One can clearly observe two findings: 1) the ranking order of explanation methods is invariant for different . 2) as is larger, the PPL is much better for each explanation method. 3) the PPL improvement for Pd, Attn, and Ngrad is less after , which further validates that they are powerful in explaining NMT using only a few words.

Figure 1: PPL for each explanation method on Transformer over the IWSLT DeEn dataset with different value.
NMT Methods ZhEn FrEn
Base SA Base SA

Attn 897.1 30.8 359.6 12.1
Pd 215.1 10.8 55.3 4.6
Ngrad 583.7 19 271.0 8.7
Wgrad 24126 180.9 44287 155.4

Attn - - - -
Pd 139.9 11.3 49.0 5.5
Ngrad 263.0 13.2 85.8 6.7
Wgrad 23068 243.1 50657 194.9

Table 4: The PPL comparison for two fidelity-based metric instantiations on two IWSLT datasets.

Testing on other scenarios

In the previous experiments, our metric instantiations are trained and evaluated under the same scenario, where used to extract relevant words is obtained from gold data and its label is the prediction from NMT , namely Teacher Forcing Decode. To examine the robustness of our metric, we apply the trained metric to two different scenarios: real decoding scenario (Real-Decode) where both and its label are from the NMT output; and golden data scenario (Golden-Data) where both and its label are from golden test data. The results for both scenarios are shown in Table 5.

From Table 5, we see that the ranking order for both scenarios is the same as before. To our surprise, the results in Real-Decode are even better than those in the matched Teacher Forcing Decode scenario. One possible reason is that the labels generated by a NMT system in the Real-Decode tend to be high-frequency words, which leads to better PPL. In contrast, our metric instantiation in the Golden-Data results in much higher PPL due to the mismatch between training and testing. The performance of experimenting training and testing in the same scenario like Golden-Data can be experimented in future works, however, it’s not the focus of this paper.

Methods R-Dec Golden T-Dec
Attn 11.5 57.1 13.8
Pd 4.7 23.3 5.5
Ngrad 8.2 42.0 10.7
Wgrad 115.0 223.4 119.5
Table 5: Evaluating four explanation methods on 3 different scenarios Real-Decode (R-Dec), Golden-Data (Golden) and Teacher-Forcing Decode (T-Dec)) for Transformer over IWSLT DeEn task.

4.3 Scalability on WMT tasks

Since our metric such as SA requires to extract generalized rules for each explanation method from the entire training dataset, it is computationally expensive for some explanation methods such as gradient methods to directly run on WMT tasks with large scale training data.

Effects on sample size

We randomly sample some subsets over WMT ZhEn training data that includes 22 million sentence pairs to form several new training sets. The sample sizes of the new training sets are set up to 2 million and the results are illustrated in Figure 2. The following facts are revealed. Firstly, the ranking order of four explanation methods remains unchanged with respect to different sample sizes. Secondly, with the increase of the sample size, the metric score decreases slower and slower and there is no significant drop from sampling 2 million sentence pairs to sampling 1 million.

Figure 2: PPL for each explanation method on Transformer over WMT ZhEn task with different sample sizes.
Datasets Methods Base SA
PPL Rank PPL Rank

Attn 336.4 2 27.3 3
Pd 165.3 1 7.7 1
Ngrad 435.2 3 16.5 2
Wgrad 1615.5 4 263.5 4

Attn 1862.3 2 17.0 3
Pd 1118.2 1 5.4 1
Ngrad 2827.7 3 15.1 2
Wgrad 6678.1 4 197.4 4

Attn 4271.0 3 41.1 3
Pd 1646.6 1 4.1 1
Ngrad 2810.2 2 11.8 2
Wgrad 6703.8 4 163.7 4
Table 6: The PPL and Ranking Order comparison between two fidelity-based metric instantiations (Base and SA) on three WMT datasets. “ ” denotes the mismatch of ranking order.

Results on WMT

With the analysis of effects on various sample sizes, we choose a sample size of 1 million for the following scaling experiments. The PPL results for WMT DeEn , ZhEn ,and FrEn are listed in Table 6. We can see that the order Pd Ngrad Attn Wgrad evaluated by SA still remains unchanged on these three datasets as before. One can observe that the ranking order under the baseline doesn’t agree with SA on WMT DeEn and ZhEn . Since the baseline yields in high PPL due to the mismatch we mentioned in section §3.3 ,in this case, we tend to trust the evaluation results from SA that achieves lower PPL leading to better fidelity.

4.4 Relation to Alignment Error Rate

Datasets Methods SA Alignment
PPL Rank AER Rank

Attn 30.8 3 55.0 3
Pd 10.8 1 50.6 1
Ngrad 19 2 52.9 2
Wgrad 180.9 4 79.2 4

Attn 27.3 3 42.1 2
Pd 7.7 1 32.7 1
Ngrad 16.5 2 49.3 3
Wgrad 263.5 4 79.2 4
WMT DeEn Attn 17.0 3 48.7 3
Pd 5.4 1 34.1 1
Ngrad 15.1 2 48.1 2
Wgrad 194.7 4 73.5 4

Table 7: Relation with word alignment. “ ” denotes the mismatch of ranking order.
Figure 3: AER can not evaluate explanation methods on those target words “as a result of”, which are not aligned to any word in the source sentence according to human annotation.

Since the calculation of the Alignment Error Rate (AER) requires manually annotated test datasets with ground-truth word alignments, we select three different test datasets contained such alignments for experiments, namely, IWSLT ZhEn , NIST05 ZhEn 555https://www.ldc.upenn.edu/collaborations/evaluations/nist and Zenkel DeEn   Zenkel et al. (2019). Note that unaligned target words account for 7.8%, 4.7%, and 9.2% on these three test sets respectively, which are skipped by AER for evaluating explanation methods. For example, in Figure 3, those target words ‘as a result‘ cannot be covered by AER due to the impossibility of human annotation, but for a fidelity-based metric, they can be analyzed as well.

Table 7 demonstrates that our fidelity-based metric does not agree very well with AER on the WMT ZhEn task: Ngrad is better than Attn in terms of SA but the result is opposite in terms of AER. Since the evaluation criteria of SA and AER are different, it is reasonable that their evaluation results are different. This finding is in line with the standpoint by Jacovi and Goldberg (2020): SA is an objective metric that reflects fidelity of models while AER is a subject metric based on human evaluation. However, it is observed that the ranking by SA is consistent on all three tasks but that by AER is highly dependent on different tasks.

5 Related Work

In recent years, explaining deep neural models has been a growing interest in the deep learning community, aiming at more comprehensible and trustworthy neural models. In this section, we mainly discuss two dominating ways towards it. One way is to develop explanation methods to interpret a target black-box neural network 

Bach et al. (2015); Zintgraf et al. (2017). For example, on classification tasks,  Bach et al. (2015)

propose layer-wise relevance propagation to visualize the relationship between a pair of neurons within networks, and

Li et al. (2016) introduce a gradient-based approach to understanding the compositionality in neural networks for NLP. In particular, on structured prediction tasks, many research works design similar methods to understand NMT models Ding et al. (2017); Alvarez-Melis and Jaakkola (2017); Ding et al. (2019); He et al. (2019).

The other way is to construct an interpretable model for the target network and then indirectly interpret its behavior to understand the target network on classification tasks Lei et al. (2016); Murdoch and Szlam (2017); Arras et al. (2017); Wang et al. (2019). The interpretable model is defined on top of extracted rational evidence and learned by model distillation from the target network. To extract rational evidence from the entire inputs, one either leverages a particular explanation method Lei et al. (2016); Wang et al. (2019) or an auxiliary evidence extraction model Murdoch and Szlam (2017); Arras et al. (2017). Although our work focuses on evaluating explanation methods and does not aim to construct an interpretable model, we draw inspiration from their ideas to design in Eq. (6) for our evaluation metric.

With the increasing efforts on designing new explanation methods, yet there are only a few works proposed to evaluate them.  Mohseni and Ragan (2018) propose a paradigm to evaluate explanation methods for document classification that involves human judgment for evaluation.  Poerner et al. (2018) conduct the first human-independent comprehensive evaluation of explanation methods for NLP tasks. However, their metrics are task-specific because they make some assumptions for a specific task. Our work proposes a principled metric to evaluate explanation methods for NMT and our evaluation paradigm is independent of any assumptions as well as humans. It is worth noting that Arras et al. (2016); Denil et al. (2014) directly measure the performance of the target model on the extracted words without constructing to evaluate explanation methods for classification tasks. However, since translation is more complex than classification tasks, trained on the entire context typically makes a terrible prediction when testing on the compressed context . As a result, the poor prediction performance makes it difficult to discriminate one explanation method from others, as observed in our internal experiments. Concurrently, Jacovi and Goldberg (2020) make a proposition to evaluate faithfulness of an explanation method separately from readability and plausibility (i.e., human-interpretability), which is similar to our definition of fidelity, but they do not formalize a metric or propose algorithms to measure it.

6 Conclusions

This paper has made an initial attempt to evaluate explanation methods from a new viewpoint. It has presented a principled metric based on fidelity in regard to the predictive behavior of the NMT model. Since it is intractable to exactly calculate the principled metric for a given explanation method, it thereby proposes an approximate approach to address the minimization problem. The proposed approach does not rely on human annotation and can be used to evaluate explanation methods on all target words. On six standard translation tasks, the metric quantitatively evaluates and compares four different explanation methods for two popular translation models. Experiments reveal that Pd, Ngrad, and Attn are all good explanation methods that are able to construct the NMT model’s predictions with relatively low perplexity and Pd shows the best fidelity among them.


We would like to thank all anonymous reviews for their valuable suggestions. This research was supported by Tencent AI Lab.


  • D. Alvarez-Melis and T. Jaakkola (2017) A causal framework for explaining the predictions of black-box sequence-to-sequence models. In

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

    Copenhagen, Denmark, pp. 412–421. External Links: Link, Document Cited by: §1, §5.
  • L. Arras, F. Horn, G. Montavon, K. Müller, and W. Samek (2016)

    Explaining predictions of non-linear classifiers in nlp

    In Proceedings of the 1st Workshop on Representation Learning for NLP, pp. 1–7. Cited by: §3.3, §5.
  • L. Arras, F. Horn, G. Montavon, K. Müller, and W. Samek (2017) ” What is relevant in a text document?”: an interpretable machine learning approach. PloS one 12 (8), pp. e0181142. Cited by: §5.
  • S. Bach, A. Binder, G. Montavon, F. Klauschen, K. Müller, and W. Samek (2015) On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one 10 (7), pp. e0130140. Cited by: §5.
  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1, §1, §2.1, §2.2.
  • O. Bastani, C. Kim, and H. Bastani (2017) Interpreting blackbox models via model extraction. arXiv preprint arXiv:1705.08504. Cited by: §3.1.
  • C. Buciluǎ, R. Caruana, and A. Niculescu-Mizil (2006) Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 535–541. Cited by: §3.1.
  • M. Denil, A. Demiraj, and N. De Freitas (2014) Extraction of salient sentences from labelled documents. arXiv preprint arXiv:1412.6815. Cited by: §3.3, §5.
  • S. Ding, H. Xu, and P. Koehn (2019) Saliency-driven word alignment interpretation for neural machine translation. In Proceedings of WMT, pp. 1. Cited by: §1, §1, 2nd item, §5.
  • Y. Ding, Y. Liu, H. Luan, and M. Sun (2017) Visualizing and understanding neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1150–1159. Cited by: §1, §5.
  • J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin (2017) Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1243–1252. Cited by: §1.
  • S. He, Z. Tu, X. Wang, L. Wang, M. Lyu, and S. Shi (2019) Towards understanding neural machine translation with word importance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 952–961. External Links: Link, Document Cited by: §1, §5.
  • G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §3.1.
  • A. Jacovi and Y. Goldberg (2020) Towards faithfully interpretable nlp systems: how should we define and evaluate faithfulness?. ArXiv abs/2004.03685. Cited by: §4.4, §5.
  • P. Koehn (2009) Statistical machine translation. Cambridge University Press. Cited by: footnote 2.
  • H. Lakkaraju, S. H. Bach, and J. Leskovec (2016) Interpretable decision sets: a joint framework for description and prediction. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1675–1684. Cited by: §3.1.
  • T. Lei, R. Barzilay, and T. Jaakkola (2016) Rationalizing neural predictions. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 107–117. External Links: Link, Document Cited by: §5.
  • J. Li, X. Chen, E. Hovy, and D. Jurafsky (2016) Visualizing and understanding neural models in NLP. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 681–691. External Links: Link, Document Cited by: §1, 1st item, §5.
  • X. Li, G. Li, L. Liu, M. Meng, and S. Shi (2019) On the word alignment from neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1293–1303. External Links: Link, Document Cited by: §1, §1, §2.2.
  • L. Liu, M. Utiyama, A. Finch, and E. Sumita (2016) Neural machine translation with supervised attention. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, pp. 3093–3102. External Links: Link Cited by: §2.2.
  • Y. Liu, W. Che, H. Zhao, B. Qin, and T. Liu (2018) Distilling knowledge for search-based structured prediction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 1393–1402. External Links: Link, Document Cited by: §3.1.
  • H. Mi, Z. Wang, and A. Ittycheriah (2016) Supervised attentions for neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 2283–2288. External Links: Link, Document Cited by: §2.2.
  • S. Mohseni and E. D. Ragan (2018) A human-grounded evaluation benchmark for local explanations of machine learning. arXiv preprint arXiv:1801.05075. Cited by: §5.
  • W. J. Murdoch and A. Szlam (2017)

    Automatic rule extraction from long short term memory networks

    In International Conference on Learning Representations, Cited by: §5.
  • M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli (2019) Fairseq: a fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, Cited by: §4.1, §4.1.
  • N. Poerner, H. Schütze, and B. Roth (2018) Evaluating neural network explanation methods using hybrid documents and morphosyntactic agreement. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 340–350. External Links: Link, Document Cited by: §2.2, §5.
  • A. Polino, R. Pascanu, and D. Alistarh (2018) Model compression via distillation and quantization. CoRR abs/1802.05668. External Links: Link, 1802.05668 Cited by: §3.1.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 3104–3112. External Links: Link Cited by: §1.
  • X. Tan, Y. Ren, D. He, T. Qin, and T. Liu (2019) Multilingual neural machine translation with knowledge distillation. In International Conference on Learning Representations, External Links: Link Cited by: §4.1.
  • Z. Tu, Z. Lu, Y. Liu, X. Liu, and H. Li (2016) Modeling coverage for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 76–85. External Links: Link, Document Cited by: §2.2.
  • V. N. Vapnik (1999) An overview of statistical learning theory. IEEE transactions on neural networks 10 (5), pp. 988–999. Cited by: §1, §3.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §1, §2.1, footnote 1.
  • Z. Wang, Y. Zhang, M. Yu, W. Zhang, L. Pan, L. Song, K. Xu, and Y. El-Kurdi (2019) Multi-granular text encoding for self-explaining categorization. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Florence, Italy, pp. 41–45. External Links: Link, Document Cited by: §5.
  • T. Zenkel, J. Wuebker, and J. DeNero (2019) Adding interpretable attention to neural translation models improves word alignment. arXiv preprint arXiv:1901.11359. Cited by: §2.2, §4.4.
  • L. M. Zintgraf, T. S. Cohen, T. Adel, and M. Welling (2017) Visualizing deep neural network decisions: prediction difference analysis. arXiv preprint arXiv:1702.04595. Cited by: §5.