Neural machine translation (NMT) has recently achieved state-of-the-art results in several translation tasks [Bojar et al.2016] and for various language pairs. Its conceptual simplicity has attracted many researchers as well as a growing number of private entities that have begun to include NMT engines in their production systems.
NMT networks are directly learned from parallel bi-texts, consisting of large amounts of human sentences together with their corresponding translations. Even if all translations in a bi-text are considered suitable, i.e. the meaning is preserved and the target language is fully correct, there is a large variability in these translations: in some cases the translations follow a more or less word-for-word pattern (literal translations), while in many others the translations are showing greater latitude of expression (free translations). A good human translation is often judged by this latitude of expressions. In contrast, machine translations are usually "closer", in terms of syntactic structure and even word choices, to the input sentences. Hence, even when the translation output is very good, these translations are still generally closer to literal translations because "free translations" are by definition more complicated and less easy to learn and model. It is also a rather intuitive idea that feeding more literal translation to a neural translation engine training should facilitate the training process compared to same training with less literal translations.
We report preliminary results of experiments where we automatically simplify a human translation bi-text which is then used to train neural translation engines. Thus, boosting the learning ability of neural translation models and we show that the resulting models are performing even better than a neural translation engine trained on the reference dataset. The remaining of this paper is structured as follows. Section 2 briefly surveys previous work. Section 3 outlines our neural MT engine. Section 4 details the simplification approach presented in this paper. Section 5 reports experimental results and section 6 draws conclusion and proposes further work.
2 Related Work
A neural encoder-decoder model performing text simplification at lexical and syntactic levels is reported in [Wang et al.2016]. The work introduces a model for text simplification. However, it differs from our work in that we use a neural MT engine to simplify translations, which are further used to boost translation performance, while their end goal is text simplification. In [de Gispert et al.2015]
, a phrase-based SMT system is presented that employs as preprocessing module a neural network that models source-side preordering, aiming at finding a permutation of the source sentence that matches the target sentence word order. Also, with the objective of simplifying the translation task. The work by[Niehues et al.2016] presents a technique to combine phrase-based and neural MT. The phrase-based system is initially used to produce a first hypothesis which is then considered together with the input sentence by a neural MT engine to produce the final hypothesis. The authors claim that the combined approach shows the strength of both approaches, namely fluent translations and the ability to translate rare words.
In this work we have used one of the knowledge distillation techniques detailed by [Kim and Rush2016] where the authors train a smaller student network to perform better by learning from a larger teacher network allowing to build more compact neural MT models. With a similar objective, [Hinton et al.2015] claim that distillation works well for transferring knowledge from an ensemble or from a large highly regularised model into a smaller, distilled model.
3 Neural MT
Our NMT system follows the architecture presented in [Bahdanau et al.2014]
. It is implemented as an encoder-decoder network with multiple layers of a RNN with Long Short-Term Memory hidden units[Zaremba et al.2014].
The encoder is a bidirectional neural network that reads an input sequence and calculates a forward sequence of hidden states , and a backward sequence . The decoder is a RNN that predicts a target sequence , being and respectively the source and target sentence lengths. Each word is predicted based on a recurrent hidden state , the previously predicted word
, and a context vector. We employ the attentional architecture from [Luong et al.2015]. The framework is available on the open-source project seq2seq-attn111http://nlp.seas.harvard.edu. Additional details are given in [Crego et al.2016].
4 Translation Simplification
Translation simplification is based on the idea that any sentence may have multiple translations, all being equally suitable. Following this idea, and despite the fact that deep neural networks have achieved excellent performance on many difficult tasks, we are interested in keeping the translation task as simple as possible. Hence, for a training bi-text we are interested in translations having a similar structure as source sentences. The following example shows an English sentence translated into two distinct French sentences:
Both French translations are suitable. However, the last French translation is closer in terms of sentence structure to its English counterpart.
Producing "close" translations is the natural behaviour of Machine Translation systems. Hence, we use a neural MT system to simplify a translation bi-text. Similar to knowledge distillation, target language simplification is performed in 3 steps:
train a teacher model with reference translations,
run beam search over the training set with the teacher model,
train the student network on this new dataset.
In the next Section, we analyse the training data translated by beam search following step (2) using the models built in step (1).
4.1 Translated Language Analysis
Based on the NMT system outlined in Section 3 and following the language simplification method previously outlined, we train English-to-French and English-to-German teacher networks as detailed in Section 5.1. Using such teacher models we translate the English-side of both training sets producing respectively German and French translation hypotheses. Aiming for a better understanding of the translated languages we conduct an elementary human analysis of the French and German hypotheses. We mainly observe that in many cases, translation hypotheses produced by the teacher systems consist of paraphrases of the reference translations. Such hypotheses are closer in terms of syntactic structure to the source sentences than reference translations. Examples in Table 1 illustrate this fact. While both, Ref and Hyp translations can be considered equally good, Hyp translations are syntactically closer to the source sentence. In the first example, the reference translation replaces the verb receiving with the action of communicating, hence subject and indirect objects are switched. In the second example several rephrasings are observed: [Si cette ratification n’a pas lieu En l’absence d’une telle ratification] and finally [la commission devrait être invitée il y aurait lieu d’inviter la commission]. In both examples meaning is fully preserved and both sentences are naturally good.
|Src:||The Secretary-General has received views from Denmark and Kazakhstan.|
|Ref:||Le Danemark et le Kazakhstan ont communiqué leurs vues au Secrétaire général.|
|Hyp||Le Secrétaire général a reçu les vues du Danemark et du Kazakhstan.|
|Src:||If this ratification does not take place , the Commission should be called.|
|Ref:||En l’absence d’une telle ratification , il y aurait lieu d’inviter la Commission.|
|Hyp:||Si cette ratification n’a pas lieu , la Commission devrait être invitée.|
We conduct several experiments in order to confirm that translated hypotheses are closer to the input sentence than reference translations. We first measure the difference in length of Hyp and Ref translations with respect to the original source sentences . Figure 1 shows the histogram for the English-to-French train set. The number of target sentences with similar length than source sentences is higher for translated hypotheses than for reference translations .
Additionally, we compare the number of crossed alignments222word alignments computed using https://github.com/clab/fast_align on both language pairs (source-to-Hyp and source-to-Ref) in order to validate the closeness (similarity) of syntactic structures. Given a sentence pair with its set of alignments, we compute for each source word the number of alignment crossings between the given source word and the rest of the source words. We consider that two alignments and are crossed if . Figure 2 plots the difference in number of crossed alignments between source-to-Hyp and source-to-Ref. As it can be seen, the source-to-hyp pair has a higher number of non-crossed alignments (near 4%), while the number of words with crossed alignments is higher for the source-to-Ref pair. Statistics were computed over the same number of source words for both train pairs. Notice that translated hypotheses hyp are automatically generated. Hence, carrying an important number of translation errors that cannot be neglected. The next Section evaluates the suitability of source-to-hyp translations as a train set for our neural MT systems compared to source-to-ref translations.