With the introduction of deep neural networks to applications in machine translation, more fluent outputs have been achieved with neural machine translation (NMT) than with statistical machine translation. However, a fluent NMT output requires a large parallel corpus, which is difficult to prepare. Therefore, several studies have attempted to improve fluency in NMT without the use of a large parallel corpus.
To overcome the data-acquisition bottleneck, the use of a monolingual corpus has been explored. A monolingual corpus can be collected relatively easily, and has been known to contribute to improved statistical machine translation . Various attempts to employ a monolingual corpus have involved the following: pre-training of a translation model , initialization of distributed word representation [4, 11], and construction of a pseudo-parallel corpus by back-translation .
Here, we focus on a language modeling approach [3, 16]. Although recent efforts in NMT tend to output fluent sentences, it is difficult to reflect the linguistic properties of the target language, as only the source information is taken into consideration when performing translation . Additionally, language models are useful in that they contain target information that results in fluent output and can make predictions even if they do not know the source sentence. In previous works utilizing a language model for NMT, both the language model and the conventional translation model have been prepared, wherein the final translation is performed by weighting both models. In the Shallow Fusion mechanism , the output of the translation and language models are weighted at a fixed ratio. In the Cold Fusion mechanism , a gate function is created to dynamically determine the weight of the language model considering the translation model. In the Simple Fusion mechanism , outputs of both models are treated equally, whereas normalization steps vary.
In this research, we propose a “Dynamic Fusion” mechanism that predicts output words by attending to the language model. We hypothesize that each model should make predictions according to only the information available to the model itself; the information available to the translation model should not be referenced before prediction. In the proposed mechanism, a translation model is fused with a language model through the incorporation of word-prediction probability according to the attention. However, the models retain predictions independent of one another. Based on the weight of the attention, we analyze the predictivity of the language model and its influence on translation.
The main contributions of this paper are as follows:
We propose an attentional language model that effectively introduces a language model to NMT.
We show that fluent and adequate output can be achieved with a language model in English–Japanese translation.
We show that Dynamic Fusion significantly improves translation accuracy in a realistic setting.
Dynamic Fusion’s ability to improve translation is analyzed with respect to the weight of the attention.
2 Previous works
2.1 Shallow Fusion
Gulcehre et al.  proposed Shallow Fusion, which translates a source sentence according to the predictions of both a translation model and a language model. In this mechanism, a monolingual corpus is used to learn the language model in advance. The translation model is improved through the introduction of the knowledge of the target language.
In Shallow Fusion, a target word is predicted as follows:
where is an input of the source language, is the word-prediction probability according to the translation model, and is the word prediction probability according to the language model. Here, is a manually-determined hyper-parameter that determines the rate at which the language model is considered.
2.2 Cold Fusion
In addition to Shallow Fusion, Gulcehre et al.  proposed Deep Fusion as a mechanism that could simultaneously learn a translation model and a language model. Sriram et al.  extended Deep Fusion to Cold Fusion to pass information on a translation model for the prediction of a language model.
In this mechanism, a gating function is introduced that dynamically determines the weight, taking into consideration both a translation model and a language model. Therein, the language model predicts target words by using information from the translation model. Accuracy and fluency are improved through the joint learning of both models.
In Cold Fusion, a target word is predicted as follows:
where both and
are word-prediction logits111A logit is a probability projection layer without softmax. with the translation model and the language model, respectively; is a function that determines the rate at which the language model is considered; (), (), and () are the weights of the neural networks; and
is the concatenation of vectorsand .
2.3 Simple Fusion
Stahlberg et al.  proposed Simple Fusion, which simplifies Cold Fusion. Unlike Cold Fusion, Simple Fusion does not use a translation model to predict words output by a language model.
where denotes the word prediction logits with the translation model and denotes the word prediction probability according to the language model.
In PostNorm, the output probability of the language model is multiplied by the output probability of the translation model, wherein both models are treated according to the same scale.
In PreNorm, the log probability of the language model and the unnormalized prediction of the translation model are summed, wherein the language and translation models are treated with different scales.
Though the Simple Fusion model is relatively simple, it achieves a higher BLEU score compared to other methods that utilize language models.
3 Dynamic Fusion
An attentional language model called “Dynamic Fusion,” is proposed in this paper. In the Shallow Fusion and Simple Fusion mechanisms, information from the language model is considered with fixed weights. However, translation requires that source information be retained, such that the consideration ratios should be adjusted from token to token. Thus, both models should not be mixed with fixed weights. The Cold Fusion mechanism dynamically determines the weights of mix-in; however, the Cold Fusion mechanism passes information from the translation model to the language model before prediction, and the language model thus does not make its own prediction.
Furthermore, in the previous research, it was necessary to make the vocabularies of the translation model and language model identical because the final softmax operation is performed in the word vocabulary dimension. However, since the proposed mechanism mixes a language model as an attention, the vocabularies of the translation model and language model do not have to be completely consistent, and different word-segmentation strategies and subword units can be used. Therefore, the proposed mechanism allows the use of a language model prepared in advance.
In the proposed mechanism, the language model serves as auxiliary information for prediction. Thus, the language model is utilized independently of the translation model. Unlike Cold Fusion, this method uses a language model’s prediction score multiplied by word attention.
First, the word-prediction probability of the language model is represented as follows:
Next, hidden layers of the translation model attending to the language model are represented as follows:
where is the embedding of a word, is the conventional word attention for each word, is the word attention’s hidden state for the proposed Dynamic Fusion, and () is a weight matrix of neural networks. In Equation (12), considers the language model by multiplying with a word attention. In this mechanism, the prediction of the language model only has access to the target information up to the word currently being predicted. Additionally, the language model and translation model can be made independent by using the conventional attention mechanism.
Finally, a target word is predicted as follows:
A diagram of this mechanism is shown in Figure 1, wherein the language model is used for the translation mechanism by considering the attention obtained from both the translation model and language model.
The training procedure of the proposed mechanism follows that of Simple Fusion and is performed as follows:
A language model is trained with a monolingual corpus.
The translation model and word attention to the language model are learned by fixing the parameters of the language model.
Pre training epoch
|Maximum training epoch||100 epoch|
|Vocabulary size (w/o BPE)||30,000|
|# BPE operation||16,000|
Here, the conventional attentional NMT [1, 6] and Simple Fusion models (PostNorm, PreNorm) were prepared as baseline methods for comparison with the proposed Dynamic Fusion model. We performed English-to-Japanese translation. Using this, the translation performance of the proposed model was evaluated by taking the average of two runs with BLEU  and Rank-based Intuitive Bilingual Evaluation Score (RIBES) . In addition, a significant difference test was performed using Travatar 222http://www.phontron.com/travatar/evaluation.html with 10,000 bootstrap resampling. We performed an additional experiment on Japanese-to-English translation. The details of the setting are the same as in English-to-Japanese translation, except that we only conducted the experiment once and did not perform a statistical significance test.
The experiment uses two types of corpora: one for a translation model and the other for a language model. Thus, training data of the Asian Scientific Paper Excerpt Corpus (ASPEC)  are divided into two parts: a parallel corpus and a monolingual corpus. The parallel corpus, for translation, is composed of one million sentences with a high confidence of sentence alignment from the training data. The monolingual corpus, for language models, is composed of two million sentences from the target side of the training data that are not used in the parallel corpus. Japanese sentences were tokenized by the morphological analyzer MeCab 333https://github.com/taku910/mecab (IPADic), and English sentences were preprocessed by Moses 444http://www.statmt.org/moses/ (tokenizer, truecaser). We used development and evaluation set on the official partitioning of ASPEC as summarized in Table 2555We exclude sentences whose number of tokens with more than 60 tokens in training.. Vocabulary is determined using only the parallel corpus. For example, words existing only in the monolingual corpus are treated as unknown words at testing, even if they frequently appear in the monolingual corpus to train the language model. Additionally, experiments have been conducted with and without Byte Pair Encoding (BPE) . BPE was performed on the source side and target side separately.
The in-house implementation  of the NMT model proposed by Bahdanau et al.  and Luong et al.  is used as the baseline model; all the other methods were created based on this baseline. For comparison, settings are unified in all experiments (Table 2). In the pre-training process, only the language model is learned; the baseline performs no pre-training, as it does not have access to the language model.
|Vocabulary||TM||w/o BPE||w/ BPE||w/ BPE|
|LM||w/o BPE||w/ BPE||w/o BPE|
|Vocabulary||TM||w/o BPE||w/ BPE||w/ BPE|
|LM||w/o BPE||w/ BPE||w/o BPE|
5.1 Quantitative analysis
The BLEU and RIBES scores results are listed in Table 3 (English–Japanese) and Table 4 (Japanese–English). In both scores, we observed similar tendencies with and without BPE. Compared with the baseline model and the Simple Fusion model, Dynamic Fusion yielded improved results in terms of BLEU and RIBES scores. However, between the baseline model and Simple Fusion, PreNorm improved but PostNorm was equal or worse. Compared with PreNorm, Dynamic Fusion has improved BLEU and RIBES scores. Accordingly, the improvement of the proposed method is notable, and the use of attention yields better scores.
In the English–Japanese translation, it was also confirmed that BLEU and RIBES were improved by using a language model. RIBES was improved for the translation with Dynamic Fusion, suggesting that the proposed approach outputs adequate sentences.
The proposed method has statistically significant differences (p 0.05) in BLEU and RIBES scores compared to the baseline. There was no significant difference between baseline and Simple Fusion, as well as between Simple Fusion and the proposed method.
In addition, we conducted additional experiments in a more realistic setting. We experimented with the translation model in which BPE was performed, whereas the language model was trained on a raw corpus without BPE666We did not perform an experiment with Simple Fusion because Simple Fusion requires the vocabularies of both the language model and translation model to be identical.. It was found that the translation scores were improved as compared to the baseline model with BPE.
5.2 Qualitative analysis
In Table 5, compared with the baseline, the fluency of PreNorm and Dynamic Fusion resulted in improved translation. Additionally, it can be seen that the attentional language model provides a more natural translation of the inanimate subject in the source sentence. Unlike in English, inanimate subjects are not often used in Japanese. Thus, literal translations of an inanimate subject sounds unnatural to native Japanese speakers. However, PostNorm translates “線量 (dose)” into “用量 (capacity)”, which reduces adequacy.
PreNorm in Table 6 appears as a plain and fluent output. However, neither of the Simple Fusion models can correctly translate the source sentence in comparison with the baseline. In contrast, with Dynamic Fusion, the content of the source sentence is translated more accurately than in the reference translation; thus, without loss of adequacy, Dynamic Fusion maintains the same level of fluency.
This shows that the use of a language model contributes to the improvement of output fluency. Additionally, Dynamic Fusion maintains relatively superior adequacy.
In Japanese–English translation, not only our proposed method but also other language models can cope with voice changes and inversion such as in Table 7. The use of active voice in Japanese where its counterpart is using passive voice is a common way of writing in Japanese papers , and this example shows an improvement using a language model.
5.3 Influence of language model
Table 8 shows an example wherein the language model compensates for the adequacy. In general, if there is a spelling error exists in the source sentence, a proper translation may not be performed owing to the unknown word. In this example, the word “temperature” is misspelled as “temperture.” Thus, the baseline model translates the relevant part but ignores the misspelled word. However, PreNorm and Dynamic Fusion complemented the corresponding part appropriately thanks to the language model. The proposed method was able to translate without losing adequacy. This result is attributed to the language model’s ability to predict a fluent sentence.
|Source||responding to these changes DERS can compute new dose rate .|
|Reference||DERS は これら の 変化 に 対応 し て 新た な 線量 率 を 計算 できる 。|
|Baseline||これら の 変化 に 対応 する 応答 は , 新しい 線量 率 を 計算 できる 。|
|(Responses corresponding to these changes can calculate new dose rates.)|
|Simple Fusion (PostNorm)||これら の 変化 に 対応 する 応答 は 新しい 用量 率 を 計算 できる 。|
|(Responses corresponding to these changes can calculate new capacity rates.)|
|Simple Fusion (PreNorm)||これら の 変化 に 対応 する と , 新しい 線量 率 を 計算 できる 。|
|(In response to these changes, new dose rates can be calculated.)|
|Dynamic Fusion||これら の 変化 に 対応 する こと により , 新しい 線量 率 を 計算 できる 。|
|(By responding to these changes, new dose rates can be calculated.)|
|Reference||磁場 は 管 軸 に 直角 か 平行 逆 方向 に 加え た 。|
|Baseline||磁場 は 右 角 または 平行 ( 流れ ) の 方向 に 与え られ , 管 軸 に 平行 で ある 。|
|Simple Fusion (PostNorm)||磁場 は 右 角度 または 平行 ( 流れ に 逆 に 逆 ) 方向 に 与え られ た 。|
|Simple Fusion (PreNorm)||磁場 は 右 角 または 平行 ( 流れ に 逆 方向 ) の 方向 に 与え られ た 。|
|Dynamic Fusion||磁場 は , 管 軸 に 直角 または 平行 ( 流れ に 逆 方向 ) の 方向 に 与え られる 。|
5.4 Influence of Dynamic Fusion
Excerpts from the output of Dynamic Fusion and word attention (top 5 words) are presented in Table 9.
Except for the first token777The language model cannot predict that the first token correctly because it starts with <BOS>. , the word attention includes the most likely outputs. For example, if “start bracket (「) ” is present in the sentence, there is a tendency to try to close it with “end bracket (」)”. Additionally, it is not desirable to close brackets with “発電 (power generation)”; therefore, it predicts that the subsequent word is “所 (plant)”. This indicates that the attentional language model can improve fluency while maintaining the source information.
Regarding attention weights, there are cases in which only certain words have highly skewed attention weights, among other cases in which multiple words have uniform attention weights. The latter occurs when there are many translation options, such as the generation of function words on the target side. This topic requires further investigation.
In contrast, it is extremely rare for Dynamic Fusion itself to return an adequate translation at the expense of fluency. Even if a particular word has a significantly higher weight than other words, the prediction of the translation model may likely be used for the output if it changes the meaning of the source sentence. In fact, the example in Table 9 contains many tokens in which the output of the language model is not considered, including at the beginning of the sentence.
One of the reasons for this is considered to be the difference in contributions between the translation model and the language model. We decomposed the transformation weight matrix in Equation (12) into the translation model and the language model matrices, and we calculated the Frobenius norm for each matrix. The result reveals that the translation model contributes about twice as much as the language model.
|Source||details of dose rate of ” Fugen Power Plant ” can be calculated by using <unk> software .|
|Reference||<unk> ソフトウエア を 用い て 「 ふ げん 発電 所 」 の 線量 率 を 詳細 に 計算 できる 。|
|Dynamic Fusion||「 ふ げん 発電 所 」 の 線量 率 の 詳細 を , <unk> ソフトウェア を 用い て 計算 できる 。|
|(The details of the dose rate of ”Fugen power plant”, can be calculated by using the <unk> software.)|
|「||ふ (Fu)||げん (gen)||発電 (Power)||所 (Plant)||」||の (of)|
5.4.3 Role of language model
Currently, most existing language models do not utilize the source information. Accordingly, to eliminate noise in the language model’s fluent prediction, language models should make predictions independently of translation models and thus be used in tandem with attention from translation models. However, language models are useful in that they have target information that results in fluent output; they can thus make a prediction even if they do not know the source sentence.
Ultimately, the role of the language model in the proposed mechanism is to augment the target information in order for the translation model to improve the fluency of the output sentence. Consequently, the fusion mechanism takes translation options from the language model only when it improves fluency and does not harm adequacy. It can be regarded as a regularization method to help disambiguate stylistic subtleness such as in the successful example in Table 5.
We proposed Dynamic Fusion for machine translation. For NMT, experimental results demonstrated the necessity of using an attention mechanism in conjunction with a language model. Rather than combining the language model and translation model with a fixed weight, an attention mechanism was utilized with the language model to improve fluency without reducing adequacy. This further improved the BLEU scores and RIBES.
The proposed mechanism fuses the existing language and translation models by utilizing an attention mechanism at a static ratio. In the future, we would like to consider a mechanism that can dynamically weight the mix-in ratio, as in Cold Fusion.
-  Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: Proc. of ICLR (2015)
-  Brants, T., Popat, A.C., Xu, P., Och, F.J., Dean, J.: Large language models in machine translation. In: Proc. of EMNLP-CoNLL. pp. 858–867 (2007)
-  Gulcehre, C., Firat, O., Xu, K., Cho, K., Barrault, L., Lin, H.C., Bougares, F., Schwenk, H., Bengio, Y.: On using monolingual corpora in neural machine translation. arXiv (2015)
-  Hirasawa, T., Yamagishi, H., Matsumura, Y., Komachi, M.: Multimodal machine translation with embedding prediction. In: Proc. of NAACL. pp. 86–91 (Jun 2019)
-  Isozaki, H., Hirao, T., Duh, K., Sudoh, K., Tsukada, H.: Automatic evaluation of translation quality for distant language pairs. In: Proc. of EMNLP. pp. 944–952 (2010)
-  Luong, T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. In: Proc. of EMNLP. pp. 1412–1421 (2015)
-  Luong, T., Sutskever, I., Le, Q., Vinyals, O., Zaremba, W.: Addressing the rare word problem in neural machine translation. In: Proc. of ACL. pp. 11–19 (2015)
-  Matsumura, Y., Komachi, M.: Tokyo Metropolitan University neural machine translation system for WAT 2017. In: Proc. of WAT. pp. 160–166 (2017)
-  Nakazawa, T., Yaguchi, M., Uchimoto, K., Utiyama, M., Sumita, E., Kurohashi, S., Isahara, H.: ASPEC: Asian scientific paper excerpt corpus. In: Proc. of LREC. pp. 2204–2208 (2016)
-  Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: A method for automatic evaluation of machine translation. In: Proc. of ACL. pp. 311–318 (2002)
-  Qi, Y., Sachan, D., Felix, M., Padmanabhan, S., Neubig, G.: When and why are pre-trained word embeddings useful for neural machine translation? In: Proc. of NAACL. pp. 529–535 (2018). https://doi.org/10.18653/v1/N18-2084
-  Ramachandran, P., Liu, P., Le, Q.: Unsupervised pretraining for sequence to sequence learning. In: Proc. of EMNLP. pp. 383–391 (2017). https://doi.org/10.18653/v1/D17-1039
-  Sennrich, R., Haddow, B.: Linguistic input features improve neural machine translation. In: Proc. of WMT. pp. 83–91 (2016). https://doi.org/10.18653/v1/W16-2209
-  Sennrich, R., Haddow, B., Birch, A.: Improving neural machine translation models with monolingual data. In: Proc. of ACL. pp. 86–96 (2016). https://doi.org/10.18653/v1/P16-1009
-  Sriram, A., Jun, H., Satheesh, S., Coates, A.: Cold Fusion: Training seq2seq models together with language models. arXiv (2017)
-  Stahlberg, F., Cross, J., Stoyanov, V.: Simple Fusion: Return of the language model. In: Proc. of WMT. pp. 204–211 (2018)
-  Tu, Z., Lu, Z., Liu, Y., Liu, X., Li, H.: Modeling coverage for neural machine translation. In: Proc. of ACL. pp. 76–85 (2016). https://doi.org/10.18653/v1/P16-1008
-  Yamagishi, H., Kanouchi, S., Sato, T., Komachi, M.: Improving Japanese-to-English neural machine translation by voice prediction. In: Proc. of IJCNLP. pp. 277–282 (Nov 2017)