Multilingual NMT with a language-independent attention bridge

11/01/2018 ∙ by Raúl Vázquez, et al. ∙ 0

In this paper, we propose a multilingual encoder-decoder architecture capable of obtaining multilingual sentence representations by means of incorporating an intermediate attention bridge that is shared across all languages. That is, we train the model with language-specific encoders and decoders that are connected via self-attention with a shared layer that we call attention bridge. This layer exploits the semantics from each language for performing translation and develops into a language-independent meaning representation that can efficiently be used for transfer learning. We present a new framework for the efficient development of multilingual NMT using this model and scheduled training. We have tested the approach in a systematic way with a multi-parallel data set. We show that the model achieves substantial improvements over strong bilingual models and that it also works well for zero-shot translation, which demonstrates its ability of abstraction and transfer learning.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural Machine Translation (NMT) provides a powerful approach to MT that achieves high translation accuracy and fluency due to its ability to capture long-distance dependencies and internal abstractions. Multilingual Machine Translation addresses the task of building a system that can translate between multiple languages, either by using a many-to-one approach with several source languages to translate into a single target language; a one-to-many, with a single source language to translate into different target languages; or many-to-many models that allow multiple languages on both sides (see e.g. Dong et al. (2015), Zoph and Knight (2016), Schwenk and Douze (2017)).

NMT certainly provides an ideal setting for multilingual MT because it can efficiently share the parameters of the model and take advantage of the various similarities found by the model in the hidden layers and embeddings Firat et al. (2016a); Johnson et al. (2016); Blackwood et al. (2018). Besides, multilingual NMT has the advantage of considerably improving the performance of neural translations systems for low-resourced languages Lakew et al. (2017) and it provides the possibility of translating between language pairs that were not seen during training Firat et al. (2016b), commonly called zero-shot translation Johnson et al. (2016).

NMT as first proposed with an encoder-decoder architecture Sutskever et al. (2014)

allows for learning fixed-size sentence representations embedded in continuous vector spaces. Such representations are useful since they provide means for testing downstream tasks, also enabling a deeper linguistic analysis and understanding of what the neural models are learning

Conneau et al. (2018); Conneau and Kiela (2018); Augenstein et al. (2018). However, as those basic models have strong limitations and NMT advanced in complexity becoming the de-facto standard in machine translation, the aforementioned sentence representations were replaced by the use of attention mechanisms, i.e., context vectors attending to different parts of the input sentence while generating the output sentence Bahdanau et al. (2014), and self-attention that replaces recurrent layers in the encoder and decoder Vaswani et al. (2017). Nevertheless, it is also possible to create a fixed-size vector representation while still making use of the advantages of attention mechanisms by including a compound attention layer between encoder and decoder Cífka and Bojar (2018). This is an architecture that we adapt in our approach for a multilingual setting of translation.

In this paper we focus on models that allow the translation between many languages, where we outline the development of a language-independent representation based on an attention bridge that is shared across all languages. This is in contrast with previous attempts to obtain such a ”neural interlingua” Lu et al. (2018), where the authors have only tested theirs under a one-to-many and many-to-one scenario. In order to do this, we propose an architecture based on shared self-attention for multilingual NMT with language-specific encoders and decoders, that achieves comparable results to the current state-of-the-art architectures and can as well address the task of obtaining language-independent sentence embeddings. Those embeddings are created from the encoder’s self-attention and connect to the language-specific decoders that attend to them, hence the name bridge. We also add a penalty term to avoid redundancy in the shared layer. More details of the architecture are given in section 3.

2 Related Work

Multilingual NMT has been widely studied and developed in different pathways during the last years Luong et al. (2015); Dong et al. (2015); Chen et al. (2017); Johnson et al. (2016). Work has been done with networks that use language specific encoders and decoders, such as Dong et al. (2015), who used a separate attention mechanism for each decoder on one-to-many translation. Zoph and Knight (2016) exploited a multi-way parallel corpus in a many-to-one multilingual scenario, while Firat et al. (2016a) used language-specific encoders and decoders that share a traditional attention mechansim in a many-to-many scheme. Another approach is the use of universal encoder-decoder networks that share embedding spaces to improve the performance of the model, like the one proposed by Gu et al. (2018) for improving translation on low-resourced languages and the one from Johnson et al. (2016), where the term zero-shot translation was coined.

Sentence meaning representation has as well been vastly studied under NMT settings. When introducing the encoder-decoder architectures for MT, Sutskever et al. (2014) showed that the seq2seq models are better at encoding the meaning of sentences into vector spaces than the bag-of-words model. Recent work includes that of Schwenk and Douze (2017), who use multiple encoders and decoders that are connected through a shared layer, albeit with a different purpose than performing translation. In Platanios et al. (2018) the authors show an intermediate representation that can be decoded to any target language while describing a parameter generation method for universal NMT. Cífka and Bojar (2018) introduced an architecture with a self-attentive layer to extract sentence meaning representations of fixed size. Here we use a similar architecture in a multilingual setting.

Our work on multilingual MT and sentence representations is closely related to the recently published paper by Lu et al. (2018)

. There, the authors attempt to build a neural interlingua by using language independent encoders and decoders which share an attentive long short-term memory (LSTM) layer. Our approach differs because our model is able to encode any sequence with variable length into a fixed size representation, without suffering from long-term dependency problems

Lin et al. (2017)

and without the need of padding for downstream task testing. Additionally, we also experiment in a multilingual many-to-many setting, instead of only one-to-many or many-to-one.

3 Model Architecture

In this section we introduce the proposed architecture. Given that we apply some simple modifications to the traditional attention mechanisms Bahdanau et al. (2014), we will first start by introducing it in its original formulation. After that, we proceed to introduce our architecture by building upon this theory.

In the following, it should be noted that the architecture is not restricted to RNN-based encoders and hence one could make use of CNN- or Transformer-based encoders. We made this choice for the sake of clarity in the formulation.

3.1 Background: Attention Mechanism

Given an input , a sequence of embedded tokens into the vector space , our goal is to generate a translation . The encoder

is a recurrent neural network (RNN) that sequentially reads each element in

to generate a context vector . Generally, for each token the RNN generates a hidden state where the last hidden state of the RNN often defines :



is a non-linear activation function. We use bidirectional LSTM units

Hochreiter and Schmidhuber (1997) as in this paper.

Then, the decoder network sequentially computes by optimizing


where . Each distribution is usually computed with a softmax function over all the words in the vocabulary, taking into account the current hidden state of the decoder


where is another non-linear activation function and is the size of the vocabulary.

Including an attention mechanism in the decoder implies that a different context vector will be computed at each step , instead of fixing as in equation (2) for generating all output words. This alignment method allows the decoder to assign different weights to each part of the input at every decoding step Bahdanau et al. (2014) by defining as the weighted sum of hidden states of the encoder , where indicates how much the -th input word contributes to generating the -th output word, and is usually defined as


and is a feedforward neural network.

3.2 Multilingual Model

The multilingual model we propose is a simple extension of the attention-based model previously described, with three major modifications. Namely, (i) the incorporation of a self-attention layer (attention bridge), shared among all language pairs, that serves as a neural interlingua; (ii) the use of language-specific encoders and decoders for each language pair, trainable with a language-rotating scheduler; and (iii) the introduction of a penalty term to avoid redundancy in the attention heads. We now formally develop on these features.

(i) Attention bridge: The attention bridge must serve as an intermediate layer that encodes, as much as possible, language-independent sentence representations. For this we use the concept of self-attention that has recently been applied in similar ways Lin et al. (2017); Vaswani et al. (2017); Cífka and Bojar (2018); Tao et al. (2018). For simplicity, let us assume that we have computed some encoder states as in equation (1) so that we have a matrix


Similar to Lin et al. (2017), we encode this variable length sentence-embedding matrix into a fixed size capable of focusing on different components of the sentence, defined as follows:


where and are weight matrices, is the number of attention heads (column vectors) in the attention bridge (matrix ),

the dimension of the linear transformations needed to compute

. Notice that the attention bridge matrix has a fixed size that does not depend on the length of the input sentence .

We then leverage the obtained sentence embedding by using an attention-based decoder over its components similar to Cífka and Bojar (2018). In Figure 1 we present a diagram from the same work adapted to our formulation. Formally, we only need to compute equations (6) and (7) using the columns of instead of the encoder states .

Figure 1: An overview of the proposed model for one language pair, generating the -th target word given a source sentence .111Adaptation from Cífka and Bojar (2018).

Some ways of initializing the states of the decoder consist on using the last state of the encoder Bahdanau et al. (2014), the average of the encoder states Sennrich et al. (2017), among others. Nevertheless, after comparing the results attained with them, we instead propose to use the average of the attention heads ’s, as follows


where and . This way, the decoder uses only information from the attention bridge.

(ii) Language-specific encoders and decoders: To deal with additional language pairs, the model we propose incorporates a NN encoder for each input language and an attentive decoder for each output language. This adjusts the parameters of the attention bridge with multilingual information.

Figure LABEL:fig:multilingual_diagram

shows a basic diagram to illustrate the use of several encoders and decoders that are plugged in and out at every change of batch. During training it is important to shuffle the batches to avoid overfitting the attention bridge to one specific language pair; we have used a uniform distribution in this paper for this process.

(iii) Penalization term: The proposed attention bridge matrix from equation (10) could easily learn repetitive information for different attention heads. Since we want this representation to illustrate various components of a sentence, we introduce a penalty term in the model. Taking inspiration from recent work Lin et al. (2017); Tao et al. (2018), we use the squared Frobenius norm of the symmetric matrix , with as defined in Eq. (9

), as a redundancy measure, which is added to the loss function derived form Eq. (

3). Hence, our loss function becomes

By incorporating this term into the loss function we force matrix

to be similar to the identity matrix, that is,

. Additionally, considering the fact that the rows of sum to 1, with entries in

because it approximates a discrete probability distribution, it follows that the columns of

will be forced to be approximately orthogonal, and hence we are penalizing redundancy.

4 Experimental Setup

To test the proposed architecture, we conducted four translation experiments. We used the multi30k dataset Elliott et al. (2016), a multi-parallel dataset containing 29k image captions for training and 1k sentences for validation in four European languages; Czech (cs), German (de), French (fr) and English (en). We tested the trained model with the flickr 2016 test data of the same dataset and obtained BLEU scores using the sacreBLEU script 222with signature
Post (2018).

For each language, we used an encoder consisting of 2 stacked BiLSTMs of size , i.e., the hidden states for each direction are of size 256. The neural interlingua layer we used has 10 attention heads, with a hidden dimension of 512 and , the dimensions of the linear transformations needed to compute matrix from Eq. (9). The decoder consists of 2 stacked unidirectional LSTMs with hidden states of size 512. Further, the word embeddings we used have dimension , for both the encoder input and the decoder output.

We used an SGD optimizer with a learning rate of 1.0 and batch size 64 and for each experiment we selected the best model on the development set.

We implemented our model on top of an OpenNMT-py Klein et al. (2017) fork, which we make available for reproducibility purposes.333

5 Results

In the following, we present a number of experiments that demonstrate the capabilities of the attention bridge as an effective approach to multilingual MT. We first present our baselines, then we discuss many-to-one and one-to-many settings and, finally summarize our experiments on many-to-many translation models.

5.1 Baselines

The first experiment we conducted was to corroborate that the proposed architecture works properly by examining its performance in a bilingual setting. We expect that the models slightly drop in performance due to the fixed-size attention bridge that we introduce instead of directly mapping from source to target language via cross-lingual attention links. However, we want to see whether the architecture is robust enough to carry over the essential information needed for translation with the inclusion additional intermediate abstraction layer.

In Table 1 we present a comparison of our architecture in contrast with a strong bilingual baseline consisting of an architecture with the specifications described in section 4, without the components of our model. The table presents the scores obtained for each of the 12 bilingual models trained on each language pair.

en de cs fr
en - 36.78 28.00 55.96
de 39.00 - 23.44 38.22
cs 35.89 28.98 - 36.44
fr 49.54 32.92 25.98 -

Bilingual + attention bridge
en de cs fr
en - 35.85 27.10 53.03
de 38.19 - 23.97 37.40
cs 36.41 27.28 - 36.41
fr 48.93 31.70 25.96 -

Table 1: Baseline models - comparison of BLEU scores obtained with bilingual models. All models share specifications, apart from the proposed changes to include the attention bridge layer for the second part of the table.

In this case we observe that the basic bilingual models without any attention bridge have a slightly better performance in almost every case. The biggest drop can be observed for English-French with a difference of over 2 BLEU points, but this case is exceptional. For all other languages pairs this difference lies within a range of less than 1 BLEU point.

This behaviour was expected from the fact that the information from the encoder has to be summarized in the 10 heads of the self-attention layer without (multilingual) information from other encoders to boost the states of this bridge. Nevertheless, these tests justify the validity of the architecture; namely, that the attention bridge does not cause a significant problem for the translation model in the bilingual case. We will use the results of both bilingual models with and without attention bridge as our baselines for the comparison to the multilingual models that we describe below.

5.2 Many-To-One and One-To-Many Models

The power of the attention bridge comes from its ability to share information across various language pairs. We now look at the effect including of additional languages during training on the translation performance of individual language pairs. We start by training models that include many-to-one and one-to-many settings with English as target and source, respectively. This setup makes it possible to study the ability of zero-shot translation, i.e. the translation between languages that have not been seen together in the training data. Consequently, looking at zero-shot translation, we can test the abstraction capabilities of the attention bridge.

For the first experiment, we trained a {De,Fr,Cs}En model using the many-to-one and one-to-many strategy as discussed above. As depicted in Table 2, this attempt already resulted in substantial improvements for the language pairs seen during training. The model exceeds both bilingual baselines from the previous section in all of these but French to English. However, the model is completely incapable of performing zero-shot translations. We believe that this inability of the model to generalize to unseen language pairs arises from the fact that every non-English encoder (or decoder) only learned to process information that was to be decoded into English (or encoded from English input). This finding is consistent with the results of Lu et al. (2018).

de,fr,cs en
en de cs fr
en - 37.85 29.51 57.87
de 39.39 - 0.35 0.83
cs 37.20 0.65 - 1.02
fr 48.49 0.60 0.30 -

de,fr,cs en + monolingual
en de cs fr
en - 38.92 30.27 57.87
de 40.17 - 19.50 26.46
cs 37.30 22.13 - 22.80
fr 50.41 25.96 20.09 -

Table 2: BLEU scores obtained for models trained on {De,Fr,Cs}En. Zero-shot translation (shaded cells) achieves noteworthy translation quality only when incorporating monolingual data during training.

In order to address this problem, we incorporate monolingual data in training, that is, training for each available language with identical copies of the input sentence as the target. In other words, no additional data was included during training, but we reincorporate examples from the same parallel training corpus used in all other experiments. As a consequence, we see a remarkable increase in the BLEU scores, including a substantial boost for the language pairs not seen during training. In short, the monolingual data informs the model that other languages can be produced besides English, and that English is not the unique source language.

Additionally, there is a positive effect on the seen language pairs, the cause of which is not immediately evident. One possibility may be that the shared layer acquires additional information that can be included in the abstraction process yet is not available to the other models.

5.3 Many-to-Many Models

To further examine the capabilities of the proposed architecture we conducted two experiments under a many-to-many scenario.

First, we trained six different models where we included all but one of the available language pairs. We then tested our models while also performing bidirectional zero-shot translations for the unseen language pairs. Figure LABEL:fig:BLEU_plots

summarizes these results, where we report the scores of the zero-shot translation for each source and target language. The figure shows as well the mean and standard deviation of the BLEU scores obtained by the remaining 5 models that did see the respective source and target languages during training. We observe that the zero-shot translation scores are generally better than ones from the previous {De,Fr,Cs}

En model with monolingual data, even though in this set of experiments we did not include monolingual data.

Finally, we also tested the architecture in a many-to-many setting with all language pairs included.

Table 3 summarizes the results of our experiments. As in the previous case, we compare settings that include monolingual data with their counterparts that do not include it.

On a first note, the inclusion of language pairs results in an improved performance when compared to the bilingual baselines, as well as the many-to-one and one-to-many cases. The only exception being the EnFr task. Moreover, the addition of monolingual data during training leads to even higher scores, producing the overall best model. The absolute improvements in BLEU range from 1.40 to 4.43 compared to the standard bilingual model.

en de cs fr
en - 37.70 29.67 55.78
de 40.68 - 26.78 41.07
cs 38.42 31.07 - 40.27
fr 49.92 34.63 26.92 -

m-2-m + monolingual
en de cs fr
en - 38.48 30.47 57.35
de 41.82 - 26.90 41.49
cs 39.58 31.51 - 40.87
fr 50.94 35.25 28.80 -

Table 3: The multilingual model also gets a boost when incorporating monolingual data during training.

We performed additional tests using Transformer-based encoders. These results were, however, unsatisfactory, with BLEU scores well below their RNN counterpars for all language pairs. The inability of these models to obtain good translation quality might arise from the fact that the multi30K dataset is rather small, making the Transformer prone to overfitting due to its large amount of parameters.

6 Conclusion

In this work we propose a multilingual NMT architecture with three modifications to the common attentive encoder-decoder architecture. By introducing language-specific encoders and decoders, a shared language-independent attention bridge and a penalization term that forces this layer to attend different semantic structures of the input sentence, we accomplish to successfully develop a strong multilingual translation system that efficiently incorporates transfer learning and can also tackle the task of learning multilingual sentence representations. The attention bridge consists of a self-attention layer shared among all languages that can be regarded as a neural interlingua due to its capacity of abstracting and handling encoded multilingual information.

Furthermore, we performed four different experiments to demonstrate the capabilities of the attention bridge architecture as an effective approach to multilingual MT. The results obtained consistently outperform a strong bilingual model, which suggests that the attention bridge layer has the ability to efficiently share parameters in a multilingual setting. The inclusion of monolingual data during training resulted in boosted scores for all cases.

Future work stemming from these results includes downstream-testing the sentence meaning representations produced by the shared attention bridge to verify its generalization capabilities. In addition, training models on larger datasets as well as reporting the effects of using non-multiparallel datasets would expand the scope of this work.


This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 771113)

We thank the participants that contributed to the project we lead during the 13th MT Marathon in Prague. We are particularly grateful with Chris Hokamp, whose help was crucial during that time. Finally, We would also like to acknowledge NVIDIA and their GPU grant.