Handling Homographs in Neural Machine Translation

08/22/2017 ∙ by Frederick Liu, et al. ∙ Carnegie Mellon University 0

Homographs, words with different meanings but the same surface form, have long caused difficulty for machine translation systems, as it is difficult to select the correct translation based on the context. However, with the advent of neural machine translation (NMT) systems, which can theoretically take into account global sentential context, one may hypothesize that this problem has been alleviated. In this paper, we first provide empirical evidence that existing NMT systems in fact still have significant problems in properly translating ambiguous words. We then proceed to describe methods, inspired by the word sense disambiguation literature, that model the context of the input word with context-aware word embeddings that help to differentiate the word sense be- fore feeding it into the encoder. Experiments on three language pairs demonstrate that such models improve the performance of NMT systems both in terms of BLEU score and in the accuracy of translating homographs.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural machine translation (NMT; sutskever2014sequence,bahdanau2015align, §2

), a method for MT that performs translation in an end-to-end fashion using neural networks, is quickly becoming the de-facto standard in MT applications due to its impressive empirical results. One of the drivers behind these results is the ability of NMT to capture long-distance context using recurrent neural networks in both the encoder, which takes the input and turns it into a continuous-space representation, and the decoder, which tracks the target-sentence state, deciding which word to output next. As a result of this ability to capture long-distance dependencies, NMT has achieved great improvements in a number of areas that have bedeviled traditional methods such as phrase-based MT (PBMT; koehn2003statistical), including agreement and long-distance syntactic dependencies

Neubig et al. (2015); Bentivogli et al. (2016).

Figure 1: Homographs where the baseline system makes mistakes (red words) but our proposed system incorporating a more direct representation of context achieves the correct translation (blue words). Definitions of corresponding blue and red words are in parenthesis.

One other phenomenon that was poorly handled by PBMT was homographs – words that have the same surface form but multiple senses. As a result, PBMT systems required specific separate modules to incorporate long-term context, performing word-sense Carpuat and Wu (2007b); Pu et al. (2017) or phrase-sense Carpuat and Wu (2007a) disambiguation to improve their handling of these phenomena. Thus, we may wonder: do NMT systems suffer from the same problems when translating homographs? Or are the recurrent nets applied in the encoding step, and the strong language model in the decoding step enough to alleviate all problems of word sense ambiguity?

In §3 we first attempt to answer this question quantitatively by examining the word translation accuracy of a baseline NMT system as a function of the number of senses that each word has. Results demonstrate that standard NMT systems make a significant number of errors on homographs, a few of which are shown in Fig. 1.

With this result in hand, we propose a method for more directly capturing contextual information that may help disambiguate difficult-to-translate homographs. Specifically, we learn from neural models for word sense disambiguation Kalchbrenner et al. (2014); Iyyer et al. (2015); Kågebäck and Salomonsson (2016); Yuan et al. (2016); Šuster et al. (2016), examining three methods inspired by this literature (§4). In order to incorporate this information into NMT, we examine two methods: gating the word-embeddings in the model (similarly to choi2017context), and concatenating the context-aware representation to the word embedding (§5).

To evaluate the effectiveness of our method, we compare our context-aware models with a strong baseline Luong et al. (2015) on the English-German, English-French, and English-Chinese WMT dataset. We show that our proposed model outperforms the baseline in the overall BLEU score across three different language pairs. Quantitative analysis demonstrates that our model performs better on translating homographs. Lastly, we show sample translations of the baseline system and our proposed model.

2 Neural Machine Translation

We follow the global-general-attention NMT architecture with input-feeding proposed by luong2015effective, which we will briefly summarize here. The neural network models the conditional distribution over translations given a sentence in source language as . A NMT system consists of an encoder that summarizes the source sentence

as a vector representation

, and a decoder that generates a target word at each time step conditioned on both and previous words. The conditional distribution is optimized with cross-entropy loss at each decoder output.

The encoder is usually a uni-directional or bi-directional RNN that reads the input sentence word by word. In the more standard bi-directional case, before being read by the RNN unit, each word in is mapped to an embedding in continuous vector space by a function .


is a matrix that maps a one-hot representation of , to a -dimensional vector space, and is the source vocabulary. We call the word embedding computed this way Lookup embedding. The word embeddings are then read by a bi-directional RNN


After being read by both RNNs we can compute the actual hidden state at step , , and the encoder summarized representation . The recurrent units and are usually either LSTMs Hochreiter and Schmidhuber (1997) or GRUs Chung et al. (2014).

The decoder is a uni-directional RNN that decodes the th target word conditioned on (1) previous decoder hidden state , (2) previous word , and (3) the weighted sum of encoder hidden states . The decoder maintains the th hidden state as follows,


Again, is either LSTM or GRU, and is a mapping function in target language space.

The general attention mechanism for computing the weighted encoder hidden states first computes the similarity between and for .


The similarities are then normalized through a softmax layer , which results in the weights for encoder hidden states.


We can then compute as follows,


Finally, we compute the distribution over as,

Figure 2: Translation performance of words with different numbers of senses.

3 NMT’s Problems with Homographs

As described in Eqs. (2) and (3), NMT models encode the words using recurrent encoders, theoretically endowing them with the ability to handle homographs through global sentential context. However, despite the fact that they have this ability, our qualitative observation of NMT results revealed a significant number of ambiguous words being translated incorrectly, casting doubt on whether the standard NMT setup is able to appropriately learn parameters that disambiguate these word choices.

To demonstrate this more concretely, in Fig.  2 we show the translation accuracy of an NMT system with respect to words of varying levels of ambiguity. Specifically, we use the best baseline NMT system to translate three different language pairs from WMT test set (detailed in §6) and plot the F1-score of word translations by the number of senses that they have. The number of senses for a word is acquired from the Cambridge English dictionary,222http://dictionary.cambridge.org/us/dictionary/english/ after excluding stop words.333We use the stop word list from NLTK Bird et al. (2009).

We evaluate the translation performance of words in the source side by aligning them to the target side using fast-align Dyer et al. (2013). The aligner outputs a set of target words to which the source words aligns for both the reference translation and the model translations. F1 score is calculated between the two sets of words.

After acquiring the F1 score for each word, we bucket the F1 scores by the number of senses, and plot the average score of four consecutive buckets as shown in Fig. 2. As we can see from the results, the F1 score for words decreases as the number of senses increases for three different language pairs. This demonstrates that the translation performance of current NMT systems on words with more senses is significantly decreased from that for words with fewer senses. From this result, it is evident that modern NMT architectures are not enough to resolve the problem of homographs on their own. The result corresponds to the findings in prior work Rios et al. (2017).

4 Neural Word Sense Disambiguation

Word sense disambiguation (WSD) is the task of resolving the ambiguity of homographs Ng and Lee (1996); Mihalcea and Faruque (2004); Zhong and Ng (2010); Di Marco and Navigli (2013); Chen et al. (2014); Camacho-Collados et al. (2015), and we hypothesize that by learning from these models we can improve the ability of the NMT model to choose the correct translation for these ambiguous words. Recent research tackles this problem with neural models and has shown state-of-the art results on WSD datasets Kågebäck and Salomonsson (2016); Yuan et al. (2016). In this section, we will summarize three methods for WSD which we will further utilize as three different context networks to improve NMT.

Neural bag-of-words (NBOW)

kalchbrenner2014convolutional, iyyer2015deep have shown success by representing full sentences with a context vector, which is the average of the Lookup embeddings of the input sequence


This is a simple way to model sentences, but has the potential to capture the global topic of the sentence in a straightforward and coherent way. However, in this case, the context vector would be the same for every word in the input sequence.

Bi-directional LSTM (BiLSTM)

kaageback2016word leveraged a bi-directional LSTM that learns a context vector for the target word in the input sequence and predicts the word sense with a multi-layer perceptron. Specifically, we can compute the context vector

for th word similarly to bi-directional encoder as follows,


, are forward and backward LSTMs repectively, and is a function that maps a word to continous embedding space.

Held-out LSTM (HoLSTM)

yuan2016semi trained a LSTM language model, which predicts a held-out word given the surrounding context, with a large amount of unlabeled text as training data. Given the context vector from this language model, they predict the word sense with a WSD classifier. Specifically, we can compute the context vector

for th word by first replacing th word with a special symbol (e.g. $). We then feed the replaced sequence to a uni-directional LSTM:


Finally, we can get context vector for the th word


and are defined in BiLSTM paragraph, and is the length of the sequence. Despite the fact that the context vector is always the last hidden state of the LSTM no matter which word we are targeting, the input sequence read by the HoLSTM is actually different every time.

5 Adding Context to NMT

Figure 3: Illustration of our proposed model. The context network is a differentiable network that computes context vector for word taking the whole sequence as input. represents the operation that combines original word embedding with corresponding context vector to form context-aware word embeddings.

Now that we have several methods to incorporate global context regarding a single word, it is necessary to incorporate this context with NMT. Specifically, we propose two methods to either Gate or Concatenate a context vector with the Lookup embedding to form a context-aware word embedding before feeding it into the encoder as shown in Fig. 3. The detail of these methods is described below.


Inspired by choi2017context, as our first method for integration of context-aware word embeddings, we use a gating function as follows:


The symbol represents element-wise multiplication, and

is element-wise sigmoid function. choi2017context use this method in concert with averaged embeddings from words in source language like the NBOW model above, which naturally uses the same context vectors for all time steps. In this paper, we additionally test this function with context vectors calculated using the BiLSTM and HoLSTM .


We also propose another way for incorporating context: by concatenating the context vector with the word embeddings. This is expressed as below:


is used to project the concatenated vector back to the original -dimensional space.

For each method can compute context vector with either the NBOW, BiLSTM, or HoLSTM described in §4. We share the parameters in with (i.e. ) since the vocabulary space is the same for context network and encoder. As a result, our context network only slightly increases the number of model parameters. Details about the number of parameters of each model we use in the experiments are shown in Table 1.

Context Integration uni/bi #layers #params Ppl WMT14 WMT15
None - 2 85M 7.12 20.49 22.95
None - 2 83M 7.20 21.05 23.83
None - 3 86M 7.50 20.86 23.14
NBOW Concat 2 85M 7.23 20.44 22.83
NBOW Concat 2 83M 7.28 20.76 23.61
HoLSTM Concat 2 87M 7.19 20.67 23.05
HoLSTM Concat 2 86M 7.04 21.15 23.53
BiLSTM Concat 2 87M 6.88 21.80 24.52
BiLSTM Concat 2 85M 6.87 21.33 24.37
NBOW Gating 2 85M 7.14 20.20 22.94
NBOW Gating 2 83M 6.92 21.16 23.52
BiLSTM Gating 2 87M 7.07 20.94 23.58
BiLSTM Gating 2 85M 7.11 21.33 24.05
Table 1: WMT’14, WMT’15 English-German results - We show perplexities (Ppl) on development set and tokenized BLEU on WMT’14 and WMT’15 test set of various NMT systems. We also show different settings for different systems. represents uni-directional, and represents bi-directional. We also highlight the best baseline model and the best proposed model in bold. The best baseline model will be referred as base or baseline and the best proposed model will referred to as best for further experiments.

6 Experiments

We evaluate our model on three different language pairs: English-French (WMT’14), and English-German (WMT’15), English-Chinese (WMT’17) with English as the source side. For German and French, we use a combination of Europarl v7, Common Crawl, and News Commentary as training set. For development set, newstest2013 is used for German and newstest2012 is used for French. For Chinese, we use a combination of News Commentary v12 and the CWMT Corpus as the training set and held out 2357 sentences as the development set. Translation performances are reported in case-sensitive BLEU on newstest2014 (2737 sentences), newstest2015 (2169 sentences) for German, newstest2013 (3000 sentences), newstest2014 (3003 sentences) for French, and newsdev2017 (2002 sentences) for Chinese.444We use the development set as testing data because the official test set hasn’t been released. Details about tokenization are as follows. For German, we use the tokenized dataset from luong2015effective; for French, we used the moses Koehn et al. (2007) tokenization script with the “-a” flag; for Chinese, we split sequences of Chinese characters, but keep sequences of non-Chinese characters as they are, using the script from IWSLT Evaluation 2015.555https://sites.google.com/site/iwsltevaluation2015/mt-track

We compare our context-aware NMT systems with strong baseline models on each dataset.

System BLEU
en de WMT’14 WMT’15
baseline 21.05 23.83
best 21.80 24.52
en fr WMT’13 WMT’14
baseline 28.21 31.55
best 28.77 32.39
en zh WMT’17
baseline 24.07
best 24.81
Table 2: Results on three different language pairs - The best proposed models (BiLSTM+Concat+uni) are significantly better (p-value ) than baseline models using paired bootstrap resampling Koehn (2004).

6.1 Training Details

We limit our vocabularies to be the top 50K most frequent words for both source and target language. Words not in these shortlisted vocabularies are converted into an unk token.

When training our NMT systems, following bahdanau2015align, we filter out sentence pairs whose lengths exceed 50 words and shuffle mini-batches as we proceed. We train our model with the following settings using SGD as our optimization method. (1) We start with a learning rate of 1 and we begin to halve the learning rate every epoch once it overfits.

666We define overfitting to be when perplexity on the dev set of the current epoch is worse than the previous epoch.

(2) We train until the model converges. (i.e. the difference between the perplexity for the current epoch and the previous epoch is less than 0.01) (3) We batched the instances with the same length and our maximum mini-batch size is 256, and (4) the normalized gradient is rescaled whenever its norm exceeds 5. (6) Dropout is applied between vertical RNN stacks with probability 0.3. Additionally, the context network is trained jointly with the encoder-decoder architecture. Our model is built upon OpenNMT  

Klein et al. (2017) with the default settings unless otherwise noted.

language System Homograph All Words
F1 Precision Recall F1 Precision Recall
en de baseline 0.401 0.422 0.382 0.547 0.569 0.526
best 0.426 (+0.025) 0.449 (+0.027) 0.405 (+0.023) 0.553 (+0.006) 0.576 (+0.007) 0.532 (+0.006)
en fr baseline 0.467 0.484 0.451 0.605 0.623 0.587
best 0.480 (+0.013) 0.496 (+0.012) 0.465 (+0.014) 0.613 (+0.008) 0.630 (+0.007) 0.596 (+0.009)
en zh baseline 0.578 0.587 0.570 0.573 0.605 0.544
best 0.590 (+0.012) 0.599 (+0.012) 0.581 (+0.011) 0.581 (+0.008) 0.612 (+0.007) 0.552 (+0.008)
Table 3: Translation results for homographs and all words in our NMT vocabulary. We compare scores for baseline and our best proposed model on three different language pairs. Improvements are in italic. We performed bootstrap resampling for 1000 times: our best model improved more on homographs than all words in terms of either f1, precision, or recall with , indicating statistical significance across all measures.

6.2 Experimental Results

In this section, we compare our proposed context-aware NMT models with baseline models on English-German dataset. Our baseline models are encoder-decoder models using global-general attention and input feeding on the decoder side as described in §2, varying the settings on the encoder side. Our proposed model builds upon baseline models by concatenating or gating different types of context vectors. We use LSTM for encoder, decoder, and context network. The decoder is the same across baseline models and proposed models, having 500 hidden units. During testing, we use beam search with a beam size of 5. The dimension for input word embedding is set to 500 across encoder, decoder, and context network. Settings for three different baselines are listed below.

Baseline 1:

An uni-directional LSTM with 500 hidden units and 2 layers of stacking LSTM.

Baseline 2:

A bi-directional LSTM with 250 hidden units and 2 layers of stacking LSTM. Each state is summarized by concatenating the hidden states of forward and backward encoder into 500 hidden units.

Baseline 3:

A bi-directional LSTM with 250 hidden units and 3 layers of stacking LSTM. This can be compared with the proposed method, which adds an extra layer of computation before the word embeddings, essentially adding an extra layer.

The context network uses the below settings.


Average word embedding of the input sequence.


A single-layer bi-directional LSTM with 250 hidden units. The context vector is represented by concatenating the hidden states of forward and backward LSTM into a 500 dimensional vector.


A single-layer uni-directional LSTM with 500 hidden units.

The results are shown in Table  1

. The first thing we observe is that the best context-aware model (results in bold in the table) achieved improvements of around 0.7 BLEU on both WMT14 and WMT15 over the respective baseline methods with 2 layers. This is in contrast to simply using a 3-layer network, which actually degrades performance, perhaps due to the vanishing gradients problem it increases the difficulty in learning.

Next, comparing different methods for incorporating context, we can see that BiLSTM performs best across all settings. HoLSTM performs slightly better than NBOW, and NBOW obviously suffers from having the same context vector for every word in the input sequence failing to outperform the corresponding baselines. Comparing the two integration methods that incorporate context into word embeddings. Both methods improve over the baseline with BiLSTM as the context network. Concatenating the context vector and the word embedding performed better than gating. Finally, in contrast to the baseline, it is not obvious whether using uni-directional or bi-directional as the encoder is better for our proposed models, particularly when BiLSTM is used for calculating the context network. This is likely due to the fact that bi-directional information is already captured by the context network, and may not be necessary in the encoder itself.

We further compared the two systems on two different languages, French and Chinese. We achieved 0.5-0.8 BLEU improvement, showing our proposed models are stable and consistent across different language pairs. The results are shown in Table  2.

To show that our 3-layer models are properly trained, we ran a 3-layer bidirectional encoder with residual networks on En-Fr and got 27.45 for WMT13 and 30.60 for WMT14, which is similarly lower than the two layer result. It should be noted that previous work such as  britz2017massive have also noted that the gains for encoders beyond two layers is minimal.

6.3 Targeted Analysis

In order to examine whether our proposed model can better translate words with multiple senses, we evaluate our context-aware model on a list of homographs extracted from Wikipedia777 https://en.wikipedia.org/wiki/List_of_English_homographs compared to the baseline model on three different language pairs. For the baseline model, we choose the best-performing model, as described in §6.2.

To do so, we first acquire the translation of homographs in the source language using fast-align Dyer et al. (2013). We run fast-align on all the parallel corpora including training data and testing data888Reference translation, and all the system generated translations.

because the unsupervised nature of the algorithm requires it to have a large amount of training data to obtain accurate alignments. The settings follow the default command on fast-align github page including heuristics combining forward and backward alignment. Since there might be multiple aligned words in the target language given a word in source language, we treat a match between the aligned translation of a targeted word of the reference and the translation of a given model as true positives and use F1, precision, and recall as our metrics, and take the micro-average across all the sentence pairs.

999The link to the evaluation script –
We calculated the scores for the 50000 words/characters from our source vocabulary using only English words. The results are shown in Table  3. The table shows two interesting results: (1) The score for the homographs is lower than the score obtained from all the words in the vocabulary. This shows that words with more meanings are harder to translate with Chinese as the only exception.101010One potential explanation for Chinese is that because the Chinese results are generated on the character level, the automatic alignment process was less accurate. (2) The improvement of our proposed model over baseline model is larger on the homographs compared to all the words in vocabulary. This shows that although our context-aware model is better overall, the improvements are particularly focused on words with multiple senses, which matches the intuition behind the design of the model.

6.4 Qualitative Analysis

We show sample translations on English-Chinese WMT’17 dataset in Table 4 with three kinds of examples. We highlighted the English homograph in bold, correctly translated words in blue, and wrongly translated words in red. (1) Target homographs are translated into the correct sense with the help of context network. For the first sample translation, “meets” is correctly translated to “会见” by our model, and wrongly translated to “符合” by baseline model. In fact, “会见” is closer to the definition “come together intentionally” and “符合” is closer to ”satisfy” in the English dictionary. (2) Target homographs are translated into different but similar senses for both models in the forth example. Both models translate the word “believed” to common translations “被认为” or “相信”, but these meaning are both close to reference translation “据信”. (3) Target homograph is translated into the wrong sense for the baseline model, but is not translated in our model in the fifth example.

English-Chinese Translations
src Ugandan president meets Chinese FM , anticipates closer cooperation
ref 乌 干 达 总 统 会 见 中 国 外 长 , 期 待 增 进 合 作 (come together intentionally)
best 乌 干 达 总 统 会 见 中 国 调 频 , 预 期 更 密 切 合 作 (come together intentionally)
base 乌 干 达 总 统 符 合 中 国 调 频 , 预 期 更 加 合 作 (satisfy)
src Investigators are trying to establish whether Kermiche and Petitjean had accomplices in France and
whether they had links with Islamic State , which has claimed responsibility for the attack .
ref 调 查 人 员 正 试 图 确 定 克 尔 米 奇 和 帕 迪 让 在 法 国 是 否 有 同 谋 , 以 及 是 否 与 伊 斯 兰 国 武 装 分 子 有 联 系,
伊 斯 兰 国 武 装 分 子 声 称 对 此 次 袭 击 负 责 。 (get proof of something)
best 调 查 人 员 正 试 图 确 定 Kermiche 和 Petitjean 在 法 国 是 否 有 同 谋 , 他 们 是 否 与 伊 斯 兰 国 有 联 系 ,
声 称 对 这次 袭 击 负 责 。 (get proof of something)
base 调 查 人 员 正 在 努 力 建 立 法 国 的 同 谋 和 他 们 是 否 与 伊 斯 兰 国 有 联 系 , 该 国 声 称 对 这 次 袭 击 负 有 责 任 。 (to start)
src The decrease of transaction settlement fund balance in the securities market in July was smaller than that in June ,
while the net bank @-@ securities transfers stood at negative RMB 66.6 billion .
ref 7 月 证 券 市 场 交 易 结 算 资 金 余 额 减 少 额 较 6 月 大 幅 降 低 , 银 证 转 账 变 动 净 额 为 - 666 亿 元 。 (money left)
best 7 月 份 证 券 市 场 交 易 结 算 资 金 余 额 的 减 少 小 于 6 月 份 , 而 银 行 证 券 转 让 净 额 为 negative 亿 元 。 (money left)
base 七 月 证 券 市 场 交 易 结 算 基 金 平 衡 的 减 少 比 六 月 份 小 , 而 净 银 行 证 券 转 让 则 为 负 元 。 (equal weight or force)
src Initial reports suggest that the gunman may have shot a woman , believed to be his ex @-@ partner .
ref 据 初 步 报 告 显 示 , 开 枪 者 可 能 击 中 一 名 妇 女 , 据 信 是 他 的 前 搭 档 。 (been accepted as truth)
best 初 步 的 报 道 表 明 , 枪 手 可 能 已 经 射 杀 了 一 个 女 人 , 被 认 为 是 他 的 前 伙 伴 。 (been known as)
base 最 初 的 报 道 显 示 , 枪 手 可 能 已 经 射 杀 了 一 名 妇 女 , 相 信 他 是 他 的 前 伙 伴 。 (accept as truth)
src When the game came to the last 3 ’ 49 ’ ’ , Nigeria closed to 79 @-@ 81 after Aminu added a layup .
ref 比 赛 还 有 3 分 49 秒 时 , 阿 米 努 上 篮 得 手 后 , 尼 日 利 亚 将 比 分 追 成 了 79-81 。 (narrow)
best 当 这 场 比 赛 到 了 最 后 三 个 “ 49 ” 时 , 尼 日 利 亚 在 Aminu 增 加 了 一 个 layup 之 后 MISSING TRANSLATION
base 当 游 戏 到 达 最 后 3 “ 49 ” 时 , 尼 日 利 亚 已 经 关 闭 了 Aminu 。 (end)
Table 4: Sample translations - for each example, we show sentence in source language (src), the human translated reference (ref), the translation generated by our best context-aware model (best), and the translation generated by baseline model (base). We also highlight the word with multiple senses in source language in bold, the corresponding correctly translated words in blue and wrongly translated words in red. The definitions of words in blue or red are in parenthesis.

7 Related Work

Word sense disambiguation (WSD), the task of determining the correct meaning or sense of a word in context is a long standing task in NLP  Yarowsky (1995); Ng and Lee (1996); Mihalcea and Faruque (2004); Navigli (2009); Zhong and Ng (2010); Di Marco and Navigli (2013); Chen et al. (2014); Camacho-Collados et al. (2015). Recent research on tackling WSD and capturing multi-senses includes work leveraging LSTM  Kågebäck and Salomonsson (2016); Yuan et al. (2016), which we extended as a context network in our paper and predicting senses with word embeddings that capture context. vsuster2016bilingual, kawakami2015learning also showed that bilingual data improves WSD. In contrast to the standard WSD formulation,  vickrey2005word reformulated the task of WSD for Statistical Machine Translation (SMT) as predicting possible target translations which directly improves the accuracy of machine translation. Following this reformulation, chan2007word, carpuat2007phrase,carpuat2007wsd integrated WSD systems into phrase-based systems. xiong2014sense breaks the process into two stages. First predicts the sense of the ambiguous source word. The predicted word senses together with other context features are then used to predict possible target translation. Within the framework of Neural MT, there are works that has similar motivation to ours. choi2017context leverage the NBOW as context and gate the word-embedding on both encoder and decoder side. However, their work does not distinguish context vectors for words in the same sequence, in contrast to the method in this paper, and our results demonstrate that this is an important feature of methods that handle homographs in NMT. In addition, our quantitative analysis of the problems that homographs pose to NMT and evaluation of how context-aware models fix them was not covered in this previous work. rios2017improving tackled the problem by adding sense embedding learned with additional corpus and evaluated the performance on the sentence level with contrastive translation.

8 Conclusion

Theoretically, NMT systems should be able to handle homographs if the encoder captures the clues to translate them correctly. In this paper, we empirically show that this may not be the case; the performance of word level translation degrades as the number of senses for each word increases. We hypothesize that this is due to the fact that each word is mapped to a word vector despite them being in different contexts, and propose to integrate methods from neural WSD systems into an NMT system to alleviate this problem. We concatenated the context vector computed from the context network with the word embedding to form a context-aware word embedding, successfully improving the NMT system. We evaluated our model on three different language pairs and outperformed a strong baseline model according to BLEU score in all of them. We further evaluated our results targeting the translation of homographs, and our model performed better in terms of F1 score.

While the architectures proposed in this work do not  solve the problem of homographs, our empirical results in Table 3 demonstrate that they do yield improvements (larger than those on other varieties of words). We hope that this paper will spark discussion on the topic, and future work will propose even more focused architectures.


  • Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR.
  • Bentivogli et al. (2016) Luisa Bentivogli, Arianna Bisazza, Mauro Cettolo, and Marcello Federico. 2016. Neural versus phrase-based machine translation quality: a case study. In EMNLP. pages 257–267.
  • Bird et al. (2009) Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”.
  • Britz et al. (2017) Denny Britz, Anna Goldie, Thang Luong, and Quoc Le. 2017. Massive exploration of neural machine translation architectures. arXiv:1703.03906 .
  • Camacho-Collados et al. (2015) José Camacho-Collados, Mohammad Taher Pilehvar, and Roberto Navigli. 2015. A unified multilingual semantic representation of concepts. In ACL. pages 741–751.
  • Carpuat and Wu (2007a) Marine Carpuat and Dekai Wu. 2007a. How phrase sense disambiguation outperforms word sense disambiguation for statistical machine translation. TMI pages 43–52.
  • Carpuat and Wu (2007b) Marine Carpuat and Dekai Wu. 2007b. Improving statistical machine translation using word sense disambiguation. In EMNLP-CoNLL. pages 61–72.
  • Chan et al. (2007) Yee Seng Chan, Hwee Tou Ng, and David Chiang. 2007. Word sense disambiguation improves statistical machine translation. In ACL. volume 45, pages 33–40.
  • Chen et al. (2014) Xinxiong Chen, Zhiyuan Liu, and Maosong Sun. 2014. A unified model for word sense representation and disambiguation. In EMNLP. pages 1025–1035.
  • Choi et al. (2017) Heeyoul Choi, Kyunghyun Cho, and Yoshua Bengio. 2017. Context-dependent word representation for neural machine translation. arXiv:1607.00578 .
  • Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555 .
  • Di Marco and Navigli (2013) Antonio Di Marco and Roberto Navigli. 2013. Clustering and diversifying web search results with graph-based word sense induction. Computational Linguistics 39(3):709–754.
  • Dyer et al. (2013) Chris Dyer, Victor Chahuneau, and Noah A Smith. 2013. A simple, fast, and effective reparameterization of ibm model 2. NAACL-HLT, pages 644–648.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
  • Iyyer et al. (2015) Mohit Iyyer, Varun Manjunatha, Jordan L Boyd-Graber, and Hal Daumé III. 2015. Deep unordered composition rivals syntactic methods for text classification. In ACL. pages 1681–1691.
  • Kågebäck and Salomonsson (2016) Mikael Kågebäck and Hans Salomonsson. 2016. Word sense disambiguation using a bidirectional lstm. COLING 2016 page 51.
  • Kalchbrenner et al. (2014) Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014.

    A convolutional neural network for modelling sentences.

    ACL pages 212–217.
  • Kawakami and Dyer (2016) Kazuya Kawakami and Chris Dyer. 2016. Learning to represent words in context with multilingual supervision. ICLR workshop .
  • Klein et al. (2017) G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush. 2017. OpenNMT: Open-Source Toolkit for Neural Machine Translation. arXiv:1701.02810 .
  • Koehn (2004) Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In EMNLP. pages 388–395.
  • Koehn et al. (2007) Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In ACL. pages 177–180.
  • Koehn et al. (2003) Philipp Koehn, Franz Josef Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In NAACL. pages 48–54.
  • Luong et al. (2015) Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. EMNLP pages 1412–1421.
  • Mihalcea and Faruque (2004) Rada Mihalcea and Ehsanul Faruque. 2004. Senselearner: Minimally supervised word sense disambiguation for all words in open text. In ACL/SIGLEX. volume 3, pages 155–158.
  • Navigli (2009) Roberto Navigli. 2009. Word sense disambiguation: A survey. ACM Computing Surveys (CSUR) 41(2):10.
  • Neubig et al. (2015) Graham Neubig, Makoto Morishita, and Satoshi Nakamura. 2015. Neural reranking improves subjective quality of machine translation: Naist at wat2015. WAT .
  • Ng and Lee (1996) Hwee Tou Ng and Hian Beng Lee. 1996. Integrating multiple knowledge sources to disambiguate word sense: An exemplar-based approach. In ACL. pages 40–47.
  • Pu et al. (2017) Xiao Pu, Nikolaos Pappas, and Andrei Popescu-Belis. 2017. Sense-aware statistical machine translation using adaptive context-dependent clustering. In WMT. pages 1–10.
  • Rios et al. (2017) Annette Rios, Laura Mascarell, and Rico Sennrich. 2017. Improving word sense disambiguation in neural machine translation with sense embeddings. WMT 2017 page 11.
  • Šuster et al. (2016) Simon Šuster, Ivan Titov, and Gertjan van Noord. 2016.

    Bilingual learning of multi-sense embeddings with discrete autoencoders.

    NAACL-HLT pages 1346–1356.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In NIPS. pages 3104–3112.
  • Vickrey et al. (2005) David Vickrey, Luke Biewald, Marc Teyssier, and Daphne Koller. 2005. Word-sense disambiguation for machine translation. In HLT-EMNLP. pages 771–778.
  • Xiong and Zhang (2014) Deyi Xiong and Min Zhang. 2014. A sense-based translation model for statistical machine translation. In ACL. pages 1459–1469.
  • Yarowsky (1995) David Yarowsky. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In ACL. pages 189–196.
  • Yuan et al. (2016) Dayu Yuan, Julian Richardson, Ryan Doherty, Colin Evans, and Eric Altendorf. 2016. Semi-supervised word sense disambiguation with neural models. arXiv:1603.07012 .
  • Zhong and Ng (2010) Zhi Zhong and Hwee Tou Ng. 2010. It makes sense: A wide-coverage word sense disambiguation system for free text. In ACL. pages 78–83.