Machine Learning Knowledge Exchange
In this paper, we propose a novel finetuning algorithm for the recently introduced multi-way, mulitlingual neural machine translate that enables zero-resource machine translation. When used together with novel many-to-one translation strategies, we empirically show that this finetuning algorithm allows the multi-way, multilingual model to translate a zero-resource language pair (1) as well as a single-pair neural translation model trained with up to 1M direct parallel sentences of the same language pair and (2) better than pivot-based translation strategy, while keeping only one additional copy of attention-related parameters.READ FULL TEXT VIEW PDF
We propose multi-way, multilingual neural machine translation. The propo...
The neural machine translation model has suffered from the lack of
In this paper, we proposed two strategies which can be applied to a
Unsupervised neural machine translation (UNMT) has recently achieved
We study the application of active learning techniques to the translatio...
In this paper we share several experiments trying to automatically trans...
Most neural machine translation systems still translate sentences in
Machine Learning Knowledge Exchange
A recently introduced neural machine translation [Forcada and Ñeco1997, Kalchbrenner and Blunsom2013, Sutskever et al.2014, Cho et al.2014] has proven to be a platform for new opportunities in machine translation research. Rather than word-level translation with language-specific preprocessing, neural machine translation has found to work well with statistically segmented subword sequences as well as sequences of characters [Chung et al.2016, Luong and Manning2016, Sennrich et al.2015b, Ling et al.2015]. Also, recent works show that neural machine translation provides a seamless way to incorporate multiple modalities other than natural language text in translation [Luong et al.2015a, Caglayan et al.2016]. Furthermore, neural machine translation has been found to translate between multiple languages, achieving better translation quality by exploiting positive language transfer [Dong et al.2015, Firat et al.2016, Zoph and Knight2016].
In this paper, we conduct in-depth investigation into the recently proposed multi-way, multilingual neural machine translation [Firat et al.2016]. Specifically, we are interested in its potential for zero-resource machine translation, in which there does not exist any direct parallel examples between a target language pair. Zero-resource translation has been addressed by pivot-based translation in traditional machine translation research [Wu and Wang2007, Utiyama and Isahara2007], but we explore a way to use the multi-way, multilingual neural model to translate directly from a source to target language.
In doing so, we begin by studying different translation strategies available in the multi-way, multilingual model in Sec. 3–4. The strategies include a usual one-to-one translation as well as variants of many-to-one translation for multi-source translation [Zoph and Knight2016]. We empirically show that the many-to-one strategies significantly outperform the one-to-one strategy.
We move on to zero-resource translation by first evaluating a vanilla multi-way, multilingual model on a zero-resource language pair, which revealed that the vanilla model cannot do zero-resource translation in Sec. 6.1. Based on the many-to-one strategies we proposed earlier, we design a novel finetuning strategy that does not require any direct parallel corpus between a target, zero-resource language pair in Sec. 5.2, which uses the idea of generating a pseudo-parallel corpus [Sennrich et al.2015a]. This strategy makes an additional copy of the attention mechanism and finetunes only this small set of parameters.
Large-scale experiments with Spanish, French and English show that the proposed finetuning strategy allows the multi-way, multilingual neural translation model to perform zero-resource translation as well as a single-pair neural translation model trained with up to 1M true parallel sentences. This result re-confirms the potential of the multi-way, multilingual model for low/zero-resource language translation, which was earlier argued by firat2016multi.
Recently firat2016multi proposed an extension of attention-based neural machine translation [Bahdanau et al.2015] that can handle multi-way, multilingual translation with a shared attention mechanism. This model was designed to handle multiple source and target languages. In this section, we briefly overview this multi-way, multilingual model. For more detailed exposition, we refer the reader to [Firat et al.2016].
The goal of multi-way, multilingual model is to build a neural translation model that can translate a source sentence given in one of languages into one of target languages. Thus to handle those source and target languages, the model consists of encoders and decoders. Unlike these language-specific encoders and decoders, only a single attention mechanism is shared across all language pairs.
An encoder for the -th source language reads a source sentence
as a sequence of linguistic symbols and returns a set of context vectors. The encoder is usually implemented as a bidirectional recurrent network [Schuster and Paliwal1997], and each context vector is a concatenation of the forward and reverse recurrent networks’ hidden states at time . Without loss of generality, we assume that the dimensionalities of the context vector for all source languages are all same.
A decoder for the -th target language is a conditional recurrent language model [Mikolov et al.2010]. At each time step , it updates its hidden state by
based on the previous hidden state , previous target symbol and the time-dependent context vector .
is a gated recurrent unit (GRU,[Cho et al.2014]).
The time-dependent context vector is computed by the shared attention mechanism as a weighted sum of the context vectors from the encoder :
The scoring function
returns a scalar and is implemented as a feedforward neural network with a single hidden layer. For more variants of the attention mechanism for machine translation, see[Luong et al.2015b].
The initial hidden state of the decoder is initialized as
With the new hidden state
, the probability distribution over the next symbol is computed by
where is a decoder specific parametric function that returns the unnormalized probability for the next target symbol being .
Training this multi-way, multilingual model does not require multi-way parallel corpora but only a set of bilingual corpora. For each bilingual pair, the conditional log-probability of a ground-truth translation given a source sentence is maximize by adjusting the relevant parameters following the gradient of the log-probability.
In the original paper by firat2016multi, only one translation strategy was evaluated, that is, one-to-one translation. This one-to-one strategy works on a source sentence given in one language by taking the encoder of that source language, the decoder of a target language and the shared attention mechanism. These three components are glued together as if they form a single-pair neural translation model and translates the source sentence into a target language.
We however notice that this is not the only translation strategy available with the multi-way, multilingual model. As we end up with multiple encoders, multiple decoder and a shared attention mechanism, this model naturally enables us to exploit a source sentence given in multiple languages, leading to a many-to-one translation strategy which was proposed recently by zoph2016multi in the context of neural machine translation.
Unlike [Zoph and Knight2016], the multi-way, multilingual model is not trained with multi-way parallel corpora. This however does not necessarily imply that the model cannot be used in this way. In the remainder of this section, we propose two alternatives for doing multi-source translation with the multi-way, multilingual model.
In this section, we consider a case where a source sentence is given in two languages, and . However, any of the approaches described below applies to more than two source languages trivially.
In this multi-way, multilingual model, multi-source translation can be thought of as averaging two separate translation paths. For instance, in the case of Es+Fr to En, we want to combine EsEn and FrEn so as to get a better English translation. We notice that there are two points in the multi-way, multilingual model where this averaging may happen.
The first candidate is to averaging two translation paths when computing the time-dependent context vector (see Eq. (1).) At each time in the decoder, we compute a time-dependent context vector for each source language, and respectively for the two source languages. In this early averaging strategy, we simply take the average of these two context vectors:
Similarly, we initialize the decoder’s hidden state to be the average of the initializers of the two encoders:
where is the decoder’s initializer (see Eq. (3).)
Alternatively, we can average those two translation paths (e.g., EsEn and FrEn) at the output level. At each time , each translation path computes the distribution over the target vocabulary, i.e., and . We then average them to get the multi-source output distribution:
An advantage of this late averaging strategy over the early averaging one is that this can work even when those two translation paths were not from a single multilingual model. They can be two separately single-pair models. In fact, if and are same and the two translation paths are simply two different models trained on the same language pair–direction, this is equivalent to constructing an ensemble, which was found to greatly improve translation quality [Sutskever et al.2014]
The two strategies above can be further combined by late-averaging the output distributions from the early averaged model and the late averaged one. We empirically evaluate this early+late average strategy as well.
Before continuing on with zero-resource machine translation, we first evaluate the translation strategies described in the previous section on multi-source translation, as these translation strategies form a basic foundation on which we extend the multi-way, multilingual model for zero-resource machine translation.
When evaluating the multi-source translation strategies, we use English, Spanish and French, and focus on a scenario where only En-Es and En-Fr parallel corpora are available.
We combine the following corpora to form 34.71m parallel Es-En sentence pairs: UN (8.8m), Europarl-v7 (1.8m), news-commentary-v7 (150k), LDC2011T07-T12 (2.9m) and internal technical-domain data (21.7m).
We combine the following corpora to form 65.77m parallel En-Fr sentence pairs: UN (9.7m), Europarl-v7 (1.9m), news-commentary-v7 (1.2m), LDC2011T07-T10 (1.6m), ReutersUN (4.5m), internal technical-domain data (23.5m) and Gigaword R2 (20.66m).
We use newstest-2012 and newstest-2013 from WMT as development and test sets, respectively.
We do not use any additional monolingual corpus.
All the sentences are tokenized using the tokenizer script from Moses [Koehn et al.2007]. We then replace special tokens, such as numbers, dates and URL’s with predefined markers, which will be replaced back with the original tokens after decoding. After using byte pair encoding (BPE, [Sennrich et al.2015b]) to get subword symbols, we end up with 37k, 43k and 45k unique tokens for English, Spanish and French, respectively. For training, we only use sentence pairs in which both sentences are only up to 50 symbols long.
See Table 1 for the detailed statistics.
We start from the code made publicly available as a part of [Firat et al.2016].†† https://github.com/nyu-dl/dl4mt-multi We made two changes to the original code. First, we replaced the decoder with the conditional gated recurrent network with the attention mechanism as outlines in [Firat and Cho2016]. Second, we feed a binary indicator vector of which encoder(s) the source sentence was processed by to the output layer of each decoder ( in Eq. (4)). Each dimension of the indicator vector corresponds to one source language, and in the case of multi-source translation, there may be more than one dimensions set to 1.
We train the following models: four single-pair models (EsEn and FrEn) and one multi-way, multilingual model (Es,Fr,EnEs,Fr,En). As proposed by firat2016multi, we share one attention mechanism for the latter case.
We closely follow the setup from [Firat et al.2016]. Each symbol is represented as a 620-dimensional vector. Any recurrent layer, be it in the encoder or decoder, consists of 1000 gated recurrent units (GRU, [Cho et al.2014]), and the attention mechanism has a hidden layer of 1200 units ( in Eq. (2)). We use Adam [Kingma and Ba2015] to train a model, and the gradient at each update is computed using a minibatch of at most 80 sentence pairs. The gradient is clipped to have the norm of at most 1 [Pascanu et al.2012]. We early-stop any training using the T-B score on a development set.†† T-B score is defined as which we found to be more stable than either TER or BLEU alone for the purpose of early-stopping [Zhao and Chen2009].
We first confirm that the multi-way, multilingual translation model indeed works as well as single-pair models on the translation paths that were considered during training, which was the major claim in [Firat et al.2016]. In Table 2, we present the results on four language pair-directions (EsEn and FrEn).
It is clear that the multi-way, multilingual model indeed performs comparably on all the four cases with less parameters (due to the shared attention mechanism.) As observed earlier in [Firat et al.2016], we also see that the multilingual model performs better when a target language is English.
We consider translating from a pair of source sentences in Spanish (Es) and French (Fr) to English (En). It is important to note that the multilingual model was not trained with any multi-way parallel corpus. Despite this, we observe that the early averaging strategy improves the translation quality (measured in BLEU) by 3 points in the case of the test set (compare Table 2 (a–b) and Table 3 (a).) We conjecture that this happens as training the multilingual model has implicitly encouraged the model to find a common context vector space across multiple source languages.
The late averaging strategy however outperforms the early averaging in both cases of multilingual model and a pair of single-pair models (see Table 3 (b)) albeit marginally. The best quality was observed when the early and late averaging strategies were combined at the output level, achieving up to +3.5 BLEU (compare Table 2 (a) and Table 3 (c).)
We emphasize again that there was no multi-way parallel corpus consisting of Spanish, French and English during training. The result presented in this section shows that the multi-way, multilingual model can exploit multiple sources effectively without requiring any multi-way parallel corpus, and we will rely on this property together with the proposed many-to-one translation strategies in the later sections where we propose and investigate zero-resource translation.
The network architecture of multi-way, multilingual model suggests the potential for translating between two languages without any direct parallel corpus available. In the setting considered in this paper (see Sec. 4.1,) these translation paths correspond to EsFr, as only parallel corpora used for training were EsEn and FrEn.
The most naive approach for translating along a zero-resource path is to simply treat it as any other path that was included as a part of training. This corresponds to the one-to-one strategy from Sec. 3.1. In our experiments, it however turned out that this naive approach does not work at all, as can be seen in Table 4 (a).
In this section, we investigate this potential of zero-resource translation with the multi-way, multilingual model in depth. More specifically, we propose a number of approaches that enable zero-resource translation without requiring any additional bilingual corpus.
The first set of approaches exploits the fact that the target zero-resource translation path can be decomposed into a sequence of high-resource translation paths [Wu and Wang2007, Utiyama and Isahara2007]. For instance, in our case, EsFr can be decomposed into a sequence of EsEn and EnFr. In other words, we translate a source sentence (Es) into a pivot language (En) and then translate the English translation into a target language (Fr).
The most basic approach here is to perform each translation path in the decomposed sequence independently from each other. This one-to-one approach introduces only a minimal computational complexity (the multiplicative factor of two.) We can further improve this one-to-one pivot-based translation by maintaining a set of -best translations from the first stage (EsEn), but this increase the overall computational complexity by the factor of , making it impractical in practice. We therefore focus only on the former approach of keeping the best pivot translation in this paper.
With the multi-way, multilingual model considered in this paper, we can extend the naive one-to-one pivot-based strategy by replacing the second stage (EnFr) to be many-to-one translation from Sec. 4.4 using both the original source language and the pivot language as a pair of source languages. We first translate the source sentence (Es) into English, and use both the original source sentence and the English translation (Es+En) to translate into the final target language (Fr).
Both approaches described and proposed above do not require any additional action on an already-trained multilingual model. They are simply different translation strategies specifically aimed at zero-resource translation.
The failure of the naive zero-resource translation earlier (see Table 4 (a)) suggests that the context vectors returned by the encoder are not compatible with the decoder, when the combination was not included during training. The good translation qualities of the translation paths included in training however imply that the representations learned by the encoders and decoders are good. Based on these two observations, we conjecture that all that is needed for a zero-resource translation path is a simple adjustment that makes the context vectors from the encoder to be compatible with the target decoder. Thus, we propose to adjust this zero-resource translation path however without any additional parallel corpus.
First, we generate a small set of pseudo bilingual pairs of sentences for the zero-resource language pair (EsFr) in interest. We randomly select sentences pairs from a parallel corpus between the target language (Fr) and a pivot language (En) and translate the pivot side (En) into the source language (Es). Then, the pivot side is discarded, and we construct a pseudo parallel corpus consisting of sentence pairs of the source and target languages (Es-Fr).
We make a copy of the existing attention mechanism, to which we refer as target-specific attention mechanism. We then finetune only this target-specific attention mechanism while keeping all the other parameters of the encoder and decoder intact, using the generated pseudo parallel corpus. We do not update any other parameters in the encoder and decoder, because they are already well-trained (evidenced by high translation qualities in Table 2) and we want to avoid disrupting the well-captured structures underlying each language.
Once the model has been finetuned with the pseudo parallel corpus, we can use any of the translation strategies described earlier in Sec. 3 for the finetuned zero-resource translation path. We expect a similar gain by using many-to-one translation, which we empirically confirm in the next section.
We use the same multi-way, multilingual model trained earlier in Sec. 4.2 to evaluate the zero-resource translation strategies. We emphasize here that this model was trained only using Es-En and Fr-En bilingual parallel corpora without any Es-Fr parallel corpus.
We evaluate the proposed approaches to zero-resource translation with the same multi-way, multilingual model from Sec. 4.1. We specifically select the path from Spanish to French (EsFr) as a target zero-resource translation path.
As mentioned earlier, we observed that the multi-way, multilingual model cannot directly translate between two languages when the translation path between those two languages was not included in training (Table 4 (a).) On the other hand, the model was able to translate decently with the pivot-based one-to-one translation strategy, as can be see in Table 4 (b). Unsurprisingly, all the many-to-one strategies resulted in worse translation quality, which is due to the inclusion of the useless translation path (direct path between the zero-resource pair, Es-Fr.) These results clearly indicate that the multi-way, multilingual model trained with only bilingual parallel corpora is not capable of direct zero-resource translation as it is.
|Pseudo Parallel Corpus||True Parallel Corpus|
|(b)||No Finetuning||Dev: 20.64, Test: 20.4||–|
The proposed finetuning strategy raises a number of questions. First, it is unclear how many pseudo sentence pairs are needed to achieve a decent translation quality. Because the purpose of this finetuning stage is simply to adjust the shared attention mechanism so that it can properly bridge from the source-side encoder to the target-side decoder, we expect it to work with only a small amount of pseudo pairs. We validate this by creating pseudo corpora of different sizes–1k, 10k, 100k and 1m.
Second, we want to know how detrimental it is to use the generated pseudo sentence pairs compared to using true sentence pairs between the target language pair. In order to answer this question, we compiled a true multi-way parallel corpus by combining the subsets of UN (7.8m), Europral-v7 (1.8m), OpenSubtitles-2013 (1m), news-commentary-v7 (174k), LDC2011T07 (335k) and news-crawl (310k), and use it to finetune the model.†† See the last row of Table 1. This allows us to evaluate the effect of the pseudo and true parallel corpora on finetuning for zero-resource translation.
Lastly, we train single-pair models translating directly from Spanish to French by using the true parallel corpora. These models work as a baseline against which we compare the multi-way, multilingual models.
Unlike the usual training procedure described in Sec. 4.2, we compute the gradient for each update using 60 sentence pairs only, when finetuning the model with the multi-way parallel corpus (either pseudo or true.)
Table 5 summarizes all the result. The most important observation is that the proposed finetuning strategy with pseudo-parallel sentence pairs outperforms the pivot-based approach (using the early averaging strategy from Sec. 4.4) even when we used only 1,000 such pairs (compare (b) and (d).) As we increase the size of the pseudo-parallel corpus, we observe a clear improvement. Furthermore, these models perform comparably to or better than the single-pair model trained with 1M true parallel sentence pairs, although they never saw a single true bilingual sentence pair of Spanish and French (compare (a) and (d).) Even when we trained a single-pair model with 11m true parallel pairs, the model could not match the multilingual model finetuned with 1m true parallel pairs by achieving the translation quality of 24.26 BLEU on the test set.
Another interesting finding is that it is only beneficial to use true parallel pairs for finetuning the multi-way, mulitilingual models when there are enough of them (1m or more). When there are only a small number of true parallel sentence pairs, we even found using pseudo pairs to be more beneficial than true ones. This effective as more apparent, when the direct one-to-one translation of the zero-resource pair was considered (see (c) in Table 5.) This applies that the misalignment between the encoder and decoder can be largely fixed by using pseudo-parallel pairs only, and we conjecture that it is easier to learn from pseudo-parallel pairs as they better reflect the inductive bias of the trained model. When there is a large amount of true parallel sentence pairs available, however, our results indicate that it is better to exploit them.
Unlike we observed with the multi-source translation in Sec. 3.2, we were not able to see any improvement by further averaging the early-averaged and late-average decoding schemes (compare (d) and (e).) This may be explained by the fact that the context vectors computed when creating a pseudo source (e.g., En from Es when EsFr) already contains all the information about the pseudo source. It is simply enough to take those context vectors into account via the early averaging scheme.
These results clearly indicate and verify the potential of the multi-way, multilingual neural translation model in performing zero-resource machine translation. More specifically, it has been shown that the translation quality can be improved even without any direct parallel corpus available, and if there is a small amount of direct parallel pairs available, the quality may improve even further.
There are two main results in this paper. First, we showed that the multi-way, multilingual neural translation model by firat2016multi is able to exploit common, underlying structures across many languages in order to better translate when a source sentence is given in multiple languages. This confirms the usefulness of positive language transfer, which has been believed to be an important factor in human language learning [Odlin1989, Ringbom2007], in machine translation. Furthermore, our result significantly expands the applicability of multi-source translation [Zoph and Knight2016], as it does not assume the availability of multi-way parallel corpora for training.
Second, the experiments on zero-resource translation revealed that it is not necessary to have a direct parallel corpus, or deep linguistic knowledge, between two languages in order to build a machine translation system. Importantly we observed that the proposed approach of zero-resource translation is better both in terms of translation quality and data efficiency than a more traditional pivot-based translation [Wu and Wang2007, Utiyama and Isahara2007]. Considering that this is the first attempt at such zero-resource, or extremely low-resource, translation using neural machine translation, we expect a large progress in near future.
Despite the promising empirical results presented in this paper, there are a number of shortcomings that needs to addressed in follow-up research. First, our experiments have been done only with three European languages–Spanish, French and English. More investigation with a diverse set of languages needs to be done in order to make a more solid conclusion, such as was done in [Firat et al.2016, Chung et al.2016]. Furthermore, the effect of varying sizes of available parallel corpora on the performance of zero-resource translation must be studied more in the future.
Second, although the proposed many-to-one translation is indeed generally applicable to any number of source languages, we have only tested a source sentence in two languages. We expect even higher improvement with more languages, but it must be tested thoroughly in the future.
Lastly, the proposed finetuning strategy requires the model to have an additional set of parameters relevant to the attention mechanism for a target, zero-resource pair. This implies that the number of parameters may grow linearly with respect to the number of target language pairs. We expect future research to address this issue by, for instance, mixing in the parallel corpora of high-resource language pairs during finetuning as well.
OF thanks Georgiana Dinu and Iulian Vlad Serban for insightful discussions. KC thanks the support by Facebook, Google (Google Faculty Award 2016) and NVidia (GPU Center of Excellence 2015-2016).