The encoder-decoder framework has been widely used in the task of neural machine translation (NMT) (Luong et al., 2015; He et al., 2017) and gradually been adopted by industry in the past several years (Zhou et al., 2016; Wu et al., 2016). In such a framework, the encoder encodes a source language sentence
into a sequence of vectorswhere is the length of the input sentence. The decoder generates a target language sentence word by word based on the source-side vector representations and the previously generated words, where is the length of the output sentence. To allow the NMT to decide which source words should take part in the predicting process of the next target words, the attention mechanism (Luong et al., 2015; Bahdanau et al., 2015)
has been applied widely in training neural networks, making models to learn alignments between different modalities.
Recently, transforming the text into a language-independent semantic space has gained significant popularity and it is a coveted goal in the field of natural language processing. Many methods proposed to learn cross-lingual word embeddings by independently training the embeddings in different languages with monolingual corpora, and then learn a linear transformation that maps them to a shared space based on a bilingual dictionary(Artetxe et al., 2016; Smith et al., 2017). In this way, the encoder is given language-independent word-level embeddings, and it only needs to learn how to compose them to build representations of larger phrases. However, those methods explored to learn the representations of the low layers (i.e.
, the layers closer to the input words, such as word embeddings), while higher layers (farther from the input) which focus on high-level semantic meanings (similar to findings in the computer vision community for image features) are usually ignored(Belinkov et al., 2017; Guo et al., 2018). Moreover, as the cross-lingual word embeddings are initially trained on the corpus of each language, we have to retrain and update them when we meet a new language case, which is very time-consuming.
For the encoder-decoder structure, there is another problem to be solved. As the model first encodes the source sentence into a high-dimensional vector, then decodes them into a single target sentence, it is hard to understand the meanings of the text and interpret what is going on inside such a procedure (Shi et al., 2016). In the reality when we are translating a sentence to another language, we first comprehend it and summarize it into a language-independent semantic space (sometimes even an image), and then use the target language to represent it (Robinson, 2012). The quality of the semantic space has a direct correlation to the performance of the translation and is one of the core problems of the natural language understanding.
In this paper, to avoid the tight dependency of the specific language pairs, we propose a novel Bi-Decoder Augmented Network for the task of the neural machine translation. Given the source input , the encoder firstly encodes it into high-level space vectors and the decoder generates the target language sequences based on them. In addition to , we also design an auxiliary decoder
, which composes an autoencoder withto reconstruct the input sentence at the training time. By simultaneously optimizing the model from two linguistic perspectives, the shared encoder could benefit from additional information embedded in a common semantic space across languages. At test time, we only use the well-trained to output the target language. Moreover, because the reference sentence of is the input sentence itself, we don’t need any additional labor consumption or training data. To ensure that the training procedure of the bi-decoder framework captures the real knowledge of the language, we use strategies like reinforcement learning and denoising to alternately optimize different objective functions which we called multi-objective learning. The contributions of this paper can be summarized as follows.
Unlike the previous studies, we study the problem of neural machine translation from the viewpoint of the augmentation of the high-level semantic space. We propose a bi-decoder structure to help the model generate a language-independent representation of the text without any additional training data.
We incorporate the reinforcement learning and denoising for multi-objective learning in our training process to enable the autoencoder structure to learn the deep semantic knowledge of the language.
We conduct extensive experiments on several high-quality datasets to show that our method significantly improves the performance of the baselines.
The rest of this paper is organized as follows. In Section 2, we introduce our proposed bi-decoder framework. In Section 3, we show our training details of the model. A variety of experimental results are presented in Section 4. We provide a brief review of the related work about link prediction in Section 5. Finally, we provide some concluding remarks in Section 6.
2 The Framework
In this section, as shown in the Figure 1
, we introduce the framework of our Bi-Decoder Augmented Network. The encoder-decoder models are typically implemented with a Recurrent Neural Network (RNN) based sequence-to-sequence structure. Such a structure directly models the probabilityof a target sentence conditioned on the source sentence , where and are the length of the sentence and . For the auxiliary decoder , although the real target is still the source sentence , we will uniformly denote it as .
2.1 Shared Encoder
Given an input sentence , the shared encoder reads
word by word and generates a hidden representation of each word:
is the recurrent unit such as Long Short-Term Memory (LSTM)(Sutskever et al., 2014)
unit or Gated Recurrent Unit (GRU)(Cho et al., 2014), is the -th word of the sentence x, is the word embedding vector of , is the hidden state. In this paper, we use the bi-directional LSTM as the recurrent unit. Compared to the word embeddings, the hidden states generated by the encoder are a higher-level semantic representation of the text, which is the source knowledge of the decoder. Therefore, a language-independent representation is very important to the comprehension of the text and directly corresponds to the quality of the neural machine translation.
Initialized by the representations obtained from the encoder, the decoders with an attention mechanism receive the word embedding of the previous word (while training, it is the previous word of the reference sentence; while testing, it is the previously generated word) at each step and generates the next word.
Although the two decoders and output different languages and fulfill different responsibilities, their structures are the same, so we will use uniform symbols to denote them. Specifically, at step , the decoder takes the previous hidden state generated by itself, previously decided word, and the source-side contextual vector as inputs. The hidden states of the decoder are computed via:
where is a unidirectional LSTM, is the -th generated word, is the hidden state. For most attention mechanisms of the encoder-decoder models, the attention steps can be summarized by the equations below:
Here, is the source-side context vector, the attention vector
is used to derive the softmax logit and loss,and are trainable parameters, the function can also take other forms. is referred as a content-based function, usually implemented among different choices:
where are trainable parameters, and we choose the third one as our content-based function based on the experimental results. The attention vector
is then fed through the softmax layer to produce the predictive distribution formulated as:
where are trainable parameters. We don’t share the parameters of the attention mechanism between two decoders and , and in this paper we regard them as a part of decoder.
Denote as the parameters of the original decoder , as the parameters of the auxiliary decoder , as the parameters of shared encoder , (, ) as the source-target language pairs of the training dataset. The training process of the encoder-decoder framework usually aims at seeking the optimal paramaters that encodes the source sequence and decodes a sentence as close as the reference target sentence. For the formula form, let , the objective function for the decoder
is the maximum log likelihood estimation:
Nevertheless, in our model we have two decoders so there are two different ends for the corresponding reference target sentences, which means that the final result of the optimization is a combination of two decoders. For the decoder , the most intuitive objective is similar to the above one:
where . However, this objective function may not be the optimal one because the training procedure with the same input and output sentences essentially tends to be a trivial copy task. In this way, the learned strategy for the model would not need to capture any real knowledge of the languages involved, as there would be many degenerated solutions that blindly copy all the elements in the input sequence. To help the model learn the deep semantic knowledge of the language, we train our system using the following strategies.
From the perspective of , the model takes an input sentence in a given language, encodes it using the shared encoder , then reconstructs it by the decoder
. In order to make the encoder truly learn the compositionality of its input words in a language independent manner, we propose to introduce random noise in the input sentences. Inspired by the denoising autoencoders(Vincent et al., 2010; Hill et al., 2016) where the model is trained to reconstruct the original version of a corrupted input sentence, we alter the word order of on the output side by making random swaps between contiguous words. More concretely, for the reference sentence (which is also the input) whose length is , we make random swaps of this kind. Denote the disordered sentence as , the objective function can be defined as:
In this way, the model needs to learn about the internal knowledge of the languages without the information of the correct word order. At the same time, by discouraging the model to rely too much on the word order of the input sequence, we can better account for the actual word order divergences across languages.
3.2 Reinforcement Learning
Different from translating the source language to the target language, our auto-encoder architecture is more like a subsidiary role, which aims to capture the main information of the text rather than transforming each exact word. Many words have synonyms which may distribute closely in the word embedding space but regarded as errors in the cross entropy objectives. To tackle this problem, inspired by previous works that leverage the technology of reinforcement learning Pan et al. (2018b, 2019), we use REINFORCE (Williams, 1992) algorithm to maximize the expected reward, which is defined as:
where is the previous action policy that reconstructs the sentence ,
is obtained by sampling from the predicted probability distributionfrom , and:
is the reward function defined to use cosine function to measure the overlap of the embedding space between the predicted text and the input text. To this end, we maximize the expectation of the similarity of and the reconstructed to ignore the grammatical structure and effectively pay more attention on the key information.
3.3 Multi-Objective Learning
During training, we alternate the mini-batch optimization of the three objective functions, based on a tunable “mixing ratio”: , means that optimizing for mini-batches followed by optimizing for mini-batches, followed by optimizing for mini-batches.
More importantly, instead of directly training the model until all of its parts converge, we first jointly train the parameters of the whole model to the decoder 90% convergence, and then fix (parameters of the ) and train using until the model fully converges. This is because that the goal of the neural machine translation is to accurately transform the source language text to the target language text, so optimizing the decoder is always the most important objective of the training procedure. By starting from a 90% convergence baseline, we can avoid the model stucking in a local minimum. Based on that we propose our training algorithm in Algorithm LABEL:al1.
4.1 Implementation Details
We evaluate our proposed algorithms and the baselines on three pairs of languages: English-to-German (EnDe), English-to-French (EnFr), and English-to-Vietnamese (EnVi). In detail, for EnDe and EnFr, we employ the standard filtered WMT’ 14, which is widely used in NMT evaluations (Luong et al., 2015; Bahdanau et al., 2015) and contains 1.9M and 2.0M training sentence pairs respectively111 http://statmt.org/wmt14/translation-task.html. We test our models on newstest2014 in both directions for EnDe. For EnVi, we use IWSLT 2015, which is a smaller scale dataset and contains 133k training set and 1.2k testing set222 https://nlp.stanford.edu/projects/nmt/data/.
We use the architecture from (Luong et al., 2015) as our baseline framework to construct our Bi-Decoder Augmented Network (BiDAN). We also employ the GNMT (Wu et al., 2016) attention to parallelize the decoder’s computation. When training our NMT systems, we split the data into subword units using BPE (Sennrich et al., 2016). We train 4 layer LSTMs of 1024 units with bidirectional encoder, 4 layer unidirectional LSTMs of 1024 units for both and , the embedding dimension is 1024. The mixing ratio
are set as 5:2:2. The model is trained with stochastic gradient descent with a learning rate that began at 1.0. We train for 680K steps; after 340K steps, we start halving learning rate every 34K step. For the baseline model, the batch size is set as 128 and for the BiDAN is 64, the dropout rate is 0.2. For the baseline and our BiDAN network, we use beam search with beam size 10 to generate sentences.
|Baseline + AD||24.0||28.2||33.6||26.2|
|Baseline + AD (Denoising)||24.3||28.4||34.0||26.6|
|Baseline + AD (RL)||24.4||28.5||33.9||26.8|
|BiDAN (All modules converge)||24.6||28.7||34.1||27.0|
|Transformer + AD||28.1||32.0|
|Transformer + AD (Denoising)||28.0||31.9|
|Transformer + AD (RL)||28.1||32.1|
As shown in the Table 1, our BiDAN model on all of the datasets performs much better compared to the baseline model. In the medium part, we conduct an ablation experiment to evaluate the individual contribution of each component of our model. At first, we only add the auxiliary decoder to the baseline model, and the BLEU scores on all the test sets rise about 1.4 point, which shows the effectiveness of our bi-decoder architecture and the significance of the language-independent representation of the text. We then train our model with the process of denoising, in other words, we take out the objective function when optimizing our BiDAN model. We observe that the performance rises around 0.3 point, which proves that focusing on the internal structure of the languages is very helpful to the task. Finally, we use the reinforcement learning with the original loss to train our auxiliary decoder, which means we don’t use the objective function . The results show that this leads to about 0.4 point improvement, which indicates that relaxing the grammatical limitation and capturing the keyword information is very useful in our bi-decoder architecture. Finally, instead of first jointly train the parameters of the whole model to the decoder 90% convergence and then fix (parameters of the ) and train using until the model fully converges, we directly training the model until all of its parts converge. The results show that training from a well-trained model may induce the local minimum of the optimization.
4.3 Transformer Ablation
We also conduct experiments on the Transformer (Vaswani et al., 2017), which is another state-of-the-art architecture for NMT. We adopt the base model and setting in the official implementation333https://github.com/tensorflow/models/tree/master/official/transformer. As depicted in Table 2, we can see that our auxiliary decoder improves the performance of the Transformer as well. However, the improvement from reinforcement learning is quite modest, while the denoising part even declines the scores. We conjecture that this is because the positional encoding of the Transformer reduces the order dependency capture of the model, thus counteracts the effects of these approaches.
|Encoder Source||BLEU ()||BLEU ()|
|EnDe Encoder (BiDAN)||27.8||2.5|
|EnFr Encoder (Original)||62.4||39.1|
4.4 Language Independence
In the Table 3
, we use encoders trained by different sources to evaluate the language-dependence extent of the high-level text representation of the models. We train the NMT model on WMT14 English-to-French dataset, and then replace the well-trained encoder with different sources at the testing time. Since the modified n-gram precision(Papineni et al., 2002) of BLEU and are all 0 except the original well-trained encoder, we use BLEU and to evlaute the difference among all the models. We first replace the encoder with random parameters and the results drop a lot, which is not surprising. Then we use the encoder trained on the WMT14 English-to-German dataset to test the performance of cross language performance. The results improve modestly compared to the random encoder but still have a huge gap to the original model. This indicates that the coarse cross language training doesn’t work even though the languages are quite similar. Finally, we replace the original encoder with the encoder from our BiDAN framework, which is also trained on the WMT14 English-to-German dataset. Although the performance is still far away from the level which is achieved by the original encoder, it is much better than the previous two methods. We point out that the only difference among those models are the parameters of the encoder, which determine the text representation. Thus the different performances demonstrate that our model provides a more language-dependent text representation, and this may also explain the improvement of our structure to the general NMT model.
|Source: Die Tter hatten Masken getragen und waren nicht erkannt worden.|
|Reference: The assailants had worn masks and had not been recognised.|
|Baseline: The perpetrators had borne masks and were not recognized.|
|BiDAN: The perpetrators had worn masks and had not been recognized.|
|Source: Zu einem Grillfest bringt Proctor beispielsweise Auberginen, Champignons und auch ein paar Vegan wrste mit.|
|Reference: For example, when attending a barbecue party, Proctor brings aubergines, mushrooms and also a few vegan sausages.|
|Baseline: Proctor, for example, brings Aubergins, Champignons and a few Vegans.|
|BiDAN: For example, Proctor brings aubergines, mushrooms and also a few vegan sausages at a barbecue.|
We present our model on the all the datasets with different values of to show how the multi-objective learning affects the performance, as depicted in Figure 2. In other words, we keep and as 2 and change the ratio of the original objective . As we can see, on all of the datasets the model drops sharply and even worse than the baseline model when we set as 0, which means we only use the objective functions and without the original one. This indicates that totally ignoring the grammatical structure of the input text is not helpful to the task. We also observe that the performance rises with the increase of until 5 or 6. Afterwards, the results get worse when we raise the , which means the multi-objective learning can improve the performance. We didn’t conduct more experiments with larger values of because the ablation experiments show that the final results will converge at the BLEU values on the second row of the Table 1, which is about 0.7 point lower than the best performance.
Figure 3 shows the performance with different lengths of the source sentences on WMT14 EnglishGerman. As we can see, our method performs better than the baseline on all of the lengths. We also find that the extent of the improvement on long sentences is larger than that in short sentences. We conjecture this is because that our method is able to deeply understand the logical relations of the sentences via the language-independent semantic space, thus performs better on the long and complex sentences.
4.6 Text Analysis
In order to better understand the behavior of the proposed model, in the Table 4, we present some examples of the comparison among the generated sentences of our model and the baseline with the reference sentences from the WMT14 GermanEnglish dataset.
In the first example, the reference translates the word “getragen” as “worn” and the word “Tter” as “assailants”. Both our method and the baseline model translate “Tter” as “perpetrators”, which is a synonym for the word “assailants”. However, the baseline model translates the word “getragen” as “borne”, which is the past participle form of the word “bear”. Interestingly, the German word “getragen” is individually translated as “carry”, which is very similar to the word “bear”, but they are not suitable for the object “mask” in English. This example proves that our model alleviates the weakness of the original unidirectional source-to-target architecture that is likely to induce the model to become a simple n-gram matching, which pays less attention to organizing natural languages.
For the second example, we can see that the translation given by the baseline model doesn’t make sense from the perspective of semantic relation, while our model accurately translates the source text. It seems that the baseline is not good at recognizing the words when their first letters are capitalized. However, by generating a language-independent semantic space via the bi-decoder structure, our model can effectively understand the meanings of the sentences as the thinking process of human being.
5 Related Works
Research on language-independent representation of the text has attracted a lot of attention in recent times. Many significant works have been proposed to learn cross-lingual word embeddings (Artetxe et al., 2016; Smith et al., 2017), which have a direct application in inherently cross-lingual tasks like machine translation (Zou et al., 2013), cross-lingual entity linking (Tsai and Dan, 2016), and part-of-speech tagging (Gaddy et al., 2016; Pan et al., 2017), etc. However, very few works pay attention to the language-independent representations of the text for the higher layers of the neural networks, which may contain higher level semantic meanings and significantly affect the performance of the model. In this work, focusing on the representations generated from the encoder, we augment the natural language understanding of the NMT model by introducing an auxiliary decoder into the original encoder-decoder framework.
There have been several proposals to improve the source-to-target dependency of the sequence-to-sequence models (Cheng et al., 2016; Xia et al., 2017; Tu et al., 2017; Artetxe et al., 2018; Pan et al., 2018a, 2019). Closely related to our work, (Tu et al., 2017) proposed to reconstruct the input source sentence from the hidden layer of the output target sentence to ensure that the information in the source side is transformed to the target side as much as possible. However, their method use the decoder states as the input of the reconstructor and deeply relies on the hidden states of the decoder, which contributes less in learning the language-independent representations. (Artetxe et al., 2018)
use monolingual corpera to train a shared encoder based on the fixed cross-lingual embeddings. However, this work is proposed to learn an NMT system in a completely unsupervised manner that remove the need of parallel data, while our method still aims to augment the performance of source-to-target model based on supervised learning.
In this paper, we propose Bi-Decoder Augmented Network for the task of the neural machine translation. We design an architecture that contains two decoders based on the encoder-decoder model where one for generating the target language and another for reconstructing the source sentence. while being trained to transform the sequence into two languages, the model has the potential to generate a language-independent semantic space of the text. The experimental evaluation shows that our model achieves significant improvement on the baselines on several standard datasets. For the future works, we would like to explore whether we can extend our idea to the different target media such as images and text or the deeper level of the neural networks.
This work was supported in part by The National Key Research and Development Program of China (Grant Nos: 2018AAA0101400), in part by The National Nature Science Foundation of China (Grant Nos: 61936006), in part by the Alibaba-Zhejiang University Joint Institute of Frontier Technologies.
- Unsupervised neural machine translation. In ICLR, Cited by: §5.
- Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In EMNLP, Cited by: §1, §5.
- Neural machine translation by jointly learning to align and translate. In ICLR, Cited by: §1, §4.1.
- What do neural machine translation models learn about morphology?. In ACL, Cited by: §1.
- Semi-supervised learning for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Cited by: §5.
- Learning phrase representations using rnn encoder–decoder for statistical machine translation. In EMNLP, pp. 1724–1734. Cited by: §2.1.
- Ten pairs to tag - multilingual pos tagging via coarse mapping between embeddings. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1307–1317. Cited by: §5.
- Soft layer-specific multi-task summarization with entailment and question generation. In arXiv preprint arXiv:1805.11004, Cited by: §1.
- Decoding with value networks for neural machine translation. In NIPS, pp. 178–187. Cited by: §1.
Learning distributed representations of sentences from unlabelled data. In Proceedings of NAACL-HLT, pp. 1367–1377. Cited by: §3.1.
- Effective approaches to attention-based neural machine translation. In EMNLP, pp. 1412–1421. Cited by: §1, §4.1, §4.1.
- Reinforced dynamic reasoning for conversational question generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2114–2124. Cited by: §3.2, §5.
- Memen: multi-layer embedding with memory networks for machine comprehension. arXiv preprint arXiv:1707.09098. Cited by: §5.
- Macnet: transferring knowledge from machine comprehension to sequence-to-sequence models. In Advances in Neural Information Processing Systems, pp. 6092–6102. Cited by: §5.
- Discourse marker augmented network with reinforcement learning for natural language inference. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 989–999. Cited by: §3.2.
- BLEU: a method for automatic evaluation of machine translation. In ACL, pp. 311–318. Cited by: §4.4.
- Becoming a translator: an introduction to the theory and practice of translation. Routledge. Cited by: §1.
- Neural machine translation of rare words with subword units. In ACL, Cited by: §4.1.
- Does string-based neural mt learn source syntax?. In EMNLP, pp. 1526–1534. Cited by: §1.
- Offline bilingual word vectors, orthogonal transformations and the inverted softmax. In ICLR, Cited by: §1, §5.
- Sequence to sequence learning with neural networks. In NIPS, pp. 3104–3112. Cited by: §2.1.
- Cross-lingual wikification using multilingual embeddings. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 589–598. Cited by: §5.
- Neural machine translation with reconstruction.. In AAAI, Cited by: §5.
- Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §4.3.
Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion.
Journal of machine learning research, Vol. 11, pp. 3371–3408. Cited by: §3.1.
- Simple statistical gradient-following algorithms for connectionist reinforcement learning. In Machine learning, Vol. 8, pp. 229–256. Cited by: §3.2.
- Google’s neural machine translation system: bridging the gap between human and machine translation. In arXiv preprint arXiv:1609.08144, Cited by: §1, §4.1.
- Deliberation networks: sequence generation beyond one-pass decoding. In Advances in Neural Information Processing Systems, pp. 1784–1794. Cited by: §5.
- Deep recurrent models with fast-forward connections for neural machine translation. In Transactions of the Association of Computational Linguistics, Cited by: §1.
- Bilingual word embeddings for phrase-based machine translation. In EMNLP, Cited by: §5.