DeepAI
Log In Sign Up

Parameter Differentiation based Multilingual Neural Machine Translation

12/27/2021
by   Qian Wang, et al.
0

Multilingual neural machine translation (MNMT) aims to translate multiple languages with a single model and has been proved successful thanks to effective knowledge transfer among different languages with shared parameters. However, it is still an open question which parameters should be shared and which ones need to be task-specific. Currently, the common practice is to heuristically design or search language-specific modules, which is difficult to find the optimal configuration. In this paper, we propose a novel parameter differentiation based method that allows the model to determine which parameters should be language-specific during training. Inspired by cellular differentiation, each shared parameter in our method can dynamically differentiate into more specialized types. We further define the differentiation criterion as inter-task gradient similarity. Therefore, parameters with conflicting inter-task gradients are more likely to be language-specific. Extensive experiments on multilingual datasets have demonstrated that our method significantly outperforms various strong baselines with different parameter sharing configurations. Further analyses reveal that the parameter sharing configuration obtained by our method correlates well with the linguistic proximities.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

07/11/2022

HLT-MT: High-resource Language-specific Training for Multilingual Neural Machine Translation

Multilingual neural machine translation (MNMT) trained in multiple langu...
06/08/2018

Multilingual Neural Machine Translation with Task-Specific Attention

Multilingual machine translation addresses the task of translating betwe...
07/14/2021

Importance-based Neuron Allocation for Multilingual Neural Machine Translation

Multilingual neural machine translation with a single model has drawn mu...
01/06/2016

Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism

We propose multi-way, multilingual neural machine translation. The propo...
03/05/2021

Hierarchical Transformer for Multilingual Machine Translation

The choice of parameter sharing strategy in multilingual machine transla...
09/09/2021

Distributionally Robust Multilingual Machine Translation

Multilingual neural machine translation (MNMT) learns to translate multi...
05/22/2022

Multilingual Machine Translation with Hyper-Adapters

Multilingual machine translation suffers from negative interference acro...

1 Introduction

Neural machine translation (NMT) has achieved great success and drawn much attention in recent years Sutskever et al. (2014); Bahdanau et al. (2015); Vaswani et al. (2017). While conventional NMT can well handle the translation of a single language pair, training an individual model for each language pair is resource-consuming, considering there are thousands of languages in the world. Therefore, multilingual NMT is developed to handle multiple language pairs in one model, greatly reducing the cost of offline training and online deployment Ha et al. (2016); Johnson et al. (2017). Besides, the parameter sharing in multilingual neural machine translation encourages positive knowledge transfer among different languages and benefits low-resource translation Zhang et al. (2020); Siddhant et al. (2020).

Figure 1: The illustration of parameter differentiation. Each task represents a translation direction, e.g. ENDE. (a) Initialized as completely shared, (b) the model detects parameters that should be more specialized during training, and (c) the shared parameters differentiated () into more specialized types ( ).

Despite the benefits of the joint training with a completely shared model, the MNMT model also suffers from insufficient model capacity Arivazhagan et al. (2019); Lyu et al. (2020). The shared parameters tend to preserve the general knowledge but ignore language-specific knowledge. Therefore, researchers resort to heuristically design additional language-specific components and build MNMT model with a mix of shared and language-specific parameters to increase the model capacity Sachan and Neubig (2018); Wang et al. (2019b), such as the language-specific attention Blackwood et al. (2018), lightweight language adapter Bapna and Firat (2019) or language-specific routing layer Zhang et al. (2021). These methods simultaneously model the general knowledge and the language-specific knowledge but require specialized manual design. Another line of works for language-specific modeling aims to automatically search for language-specific sub-networks Xie et al. (2021); Lin et al. (2021), in which they pretrain an initial large model that covers all translation directions, followed by sub-network pruning and fine-tuning. These methods include multi-stage training and it is non-trivial to determine the initial model size and structure.

In this study, we propose a novel parameter differentiation based method that enables the model to automatically determine which parameters should be shared and which ones should be language-specific during training. Inspired by cellular differentiation, a process in which a cell changes from one general cell type to a more specialized type, our method allows each parameter that shared by multiple tasks to dynamically differentiate into more specialized types. As shown in Figure 1, the model is initialized as completely shared and continuously detects shared parameters that should be language-specific. These parameters are then duplicated and reallocated to different tasks to increase language-specific modeling capacity. The differentiation criterion is defined as inter-task gradient similarity, which represents the consistency of optimization direction across tasks on a shared parameter. Therefore, the parameters facing conflicting inter-task gradients are selected for differentiation while other parameters with more similar inter-task gradients remain shared. In general, the MNMT model in our method can gradually improve its parameter sharing configuration without multi-stage training or manually designed language-specific modules.

We conduct extensive experiments on three widely used multilingual datasets including OPUS, WMT and IWSLT in multiple MNMT scenarios: one-to-many, many-to-one and many-to-many translation. The experimental results prove the effectiveness of the proposed method over various strong baselines. Our main contributions can be summarized as follows:

  • We propose a method that can automatically determine which parameters in an MNMT model should be language-specific without manual design, and can dynamically change shared parameters into more specialized types.

  • We define the differentiation criterion as the inter-task gradient similarity, which helps to minimizes the inter-task interference on shared parameters.

  • We show that the parameter sharing configuration obtained by our method is highly correlated with linguistic features like language families.

2 Background

The Transformer Model

A typical Transformer model Vaswani et al. (2017) consists of an encoder and a decoder. Both the encoder and the decoder are stacked with identical layers. Each encoder layer contains two modules named multi-head self-attention and feed-forward network. The decoder layer, containing three modules, inserts an additional multi-head cross-attention between the self-attention and feed-forward modules.

Multilingual Neural Machine Translation

The standard paradigm of MNMT contains a completely shared model borrowed from bilingual translation for all language pairs. A special language token is appended to the source text to indicate the target language, i.e., Johnson et al. (2017). The MNMT is often referred to as multi-task optimization, in which a task indicates a translation direction, e.g. ENDE.

3 Parameter Differentiation based MNMT

Our main idea is to find out shared parameters that should be language-specific in an MNMT model and dynamically change them into more specialized types during training. To achieve this, we propose a novel parameter differentiation based MNMT approach and define the differentiation criterion as inter-task gradient similarity.

Input : training data , Tasks ,
models for each task
// Initialize the shared model
1 = = = … while  not converge do
        Train the model with data // Detect parameters to differentiate
2        flagged = [] for each in shared parameters of  do
3               Evaluate with differentiation criterion if  should be language-specific then
4                      Add into flagged
5               end if
6              
7        end for
       // Reallocate parameters
8        for each shared by tasks in flagged do
9               Split into and Duplicate into , Replace in with Replace in with
10        end for
11       
12 end while
Algorithm 1 Parameter Differentiation

3.1 Parameter Differentiation

As we know that cellular differentiation is the process in which a cell changes from one cell type to another, typically from a less specialized type (stem cell) to a more specialized type (organ/tissue-specific cell) Slack (2007). Inspired by cellular differentiation, we propose parameter differentiation that can dynamically change the task-agnostic parameters in an MNMT model into other task-specific types during training.

Algorithm 1 lists the overall process of our method. We first initialize the completely shared MNMT model following the paradigm in Johnson et al. (2017). After training for several steps, the model evaluates each shared parameter and flag the parameters that should become more specialized under a certain differentiation criterion (Line -). For those flagged parameters, the model then duplicates them and reallocates the replicas for different tasks. After the duplication and reallocation, the model builds new connections for those replicas to construct different computation graphs for each task (Line -). In the following training steps, the parameters belonging to only update for training data of task . The differentiation happens after every several training steps and the model dynamically becomes more specialized.

Figure 2:

The illustration of parameter differentiation with gradient cosine similarity. The shared parameter

differentiates into for tasks and for respectively since the gradients and are more similar. denotes the global optimum of on task .

3.2 The Differentiation Criterion

The key issue in parameter differentiation is the definition of differentiation criterion that helps to detect the shared parameters that should differentiate into more specialized types. We define the differentiation criterion based on inter-task gradient cosine similarity, where the parameters facing conflicting gradients are more likely to be language-specific.

As shown in Figure 2, the parameter is shared by tasks , , and at the beginning. To determine whether the shared parameter should be more specialized, we first define the interference degree of the parameter shared by the three tasks with the inter-task gradient cosine similarity. More formally, suppose the -th parameter in an MNMT model is shared by a set of tasks , the interference degree of the parameter is defined by:

(1)

where and are the gradients of task and respectively on the parameter .

Intuitively, the gradients determine the optimization directions. For example in Figure 2, the gradient indicates the direction of global optimum for task . The gradients with maximum negative cosine similarity, such as and , point to opposite directions, which hinders the optimization and has been proved detrimental for multi-task learning Yu et al. (2020); Wang et al. (2021).

The gradients of each task on each shared parameter are evaluated on held-out validation data

. To minimize the gradient variance caused by inconsistent sentence semantics across languages, the validation data is created as multi-way aligned, i.e., each sentence has translations of all languages. With the held-out validation data, we evaluate gradients of each task on each shared parameter for calculating inter-task gradient similarities as well as the interference degree

for each parameter.

The interference degree helps the model to find out parameters that face severe interference and the parameters with high interference degrees are flagged for differentiation. Suppose the parameter shared by tasks is flagged, we cluster the tasks in into two subsets and that minimize the overall interference. The partition is obtained by:

(2)

As shown in Figure 2, the gradients of and are similar while and are in conflict with each other. By minimizing the overall interference degree, the tasks are clustered into partition . The parameter is then duplicated into and and the replicas are allocated to and respectively.

3.3 The Differentiation Granularity

Granularity Examples of Differentiation Units
Layers encoder layer, decoder layer
Module self-attention, feed-forward, cross-attention
Operation linear projection, layer normalization
Table 1: The examples of differentiation units under different granularities.

In theory, each shared parameter can differentiate into more specialized types individually. But in practice, performing differentiation on every single parameter is resource- and time-consuming, considering there are millions to billions of parameters in an MNMT model.

Therefore, we resort to different levels of differentiation granularity, like Layer, Module, or Operation. As shown in Table 1

, the Layer granularity indicates different layers in the model, while the Module granularity specifies the individual modules within a layer. The Operation granularity includes the basic transformations in the model that contain trainable parameters. With a certain granularity, the parameters are grouped into different differentiation units. For example, with the Layer level granularity, the parameters within a layer are concatenated into a vector and differentiate together, where the vector is referred to as a differentiation unit. We list the differentiation units in supplementary materials.

3.4 Training

In our method, since the model architecture dynamically changes and results in different computation graphs for each task, we create batches from the multilingual dataset and ensure that each batch contains only samples from one task. This is different from the training of vanilla completely shared MNMT model where each batch may contain sentence pairs from different languages Johnson et al. (2017). Specifically, we first sample a task , followed by sampling a batch from training data of . Then, the model which includes a mix of shared and language-specific parameters is trained with the batch .

We train the model with the Adam optimizer Kingma and Ba (2015), which computes adaptive learning rates based on the optimizing trajectory of past steps. However, the optimization history becomes inaccurate for the differentiated parameters. For the example in Figure 2, the differentiated parameter is only shared by task , while the optimization history of represents the optimizing trajectory of all the tasks. To stabilize the training of on task , we reinitialize the optimizer states by performing a warm-up update for those differentiated parameters:

(3)

where and are the Adam states of , and is the gradient of task on the held-out validation data. Note that we only update the states in the Adam optimizer and the parameters remain unchanged in the warm-up update step.

4 Experiments

4.1 Dataset

We use the public OPUS and WMT multilingual datasets to evaluate our method on many-to-one (M2O) and one-to-many (O2M) translation scenarios, and the IWSLT datasets for the many-to-many (M2M) translation scenario.

The OPUS dataset consists of English to languages selected from the original OPUS-100 dataset Zhang et al. (2020). These languages, containing M sentences for each, are from distinct language groups: Romance (French, Italian), Baltic (Latvian, Lithuanian), Uralic (Estonian, Finnish), Austronesian (Indonesian, Malay), West-Slavic (Polish, Czech) and East-Slavic (Ukrainian, Russian).

The WMT dataset with unbalanced data distribution is collected from the WMT’14, WMT’16 and WMT’18 benchmarks. We select languages with data sizes ranging from M to M. The training data sizes and sources are shown in Table 2. We report the results on the WMT dataset with the temperature-based sampling in which the temperature is set to Arivazhagan et al. (2019).

We evaluate our method on the many-to-many scenario with the IWSLT’17 dataset, which includes German, English, Italian, Romanian, and Dutch, and results in translation directions between the languages. Each translation direction contains about k sentence pairs.

The held-out multi-way aligned validation data for measuring gradient similarities contains sentences for each language, and are randomly selected and excluded from the training set. We apply the byte-pair encoding (BPE) algorithm Sennrich et al. (2016) with vocabulary sizes of k for both OPUS and WMT datasets, and k for the IWSLT dataset.

Language Pair Data Source #Samples
English-French (EN-FR) WMT’14 M
English-Czech (EN-CS) WMT’14 M
English-German (EN-DE) WMT’14 M
English-Estonian (EN-ET) WMT’18 M
English-Romanian (EN-RO) WMT’16 M
Table 2: Training data sizes and sources for the unbalanced WMT dataset.
Languages FREN ITEN LVEN LTEN ETEN
Direction
Baselines Bilingual Vaswani et al. (2017) 28.90 28.27 22.55 25.55 31.60 39.75 28.88 36.43 18.65 25.48
Multilingual Johnson et al. (2017) 27.33 28.31 21.20 27.09 30.00 40.10 27.69 37.15 20.08 30.09
Random Sharing 27.48 28.91 21.42 27.18 31.57 41.18 28.94 37.57 20.43 30.15
Tan et al. (2019) 27.39 29.21 21.97 26.77 31.85 42.71 29.27 39.34 21.40 29.79
Sachan and Neubig (2018) 28.04 29.31 22.86 27.86 32.04 41.43 28.47 38.14 21.41 30.30
Ours PD w. Layer 29.35 30.09 22.37 28.7 32.31 42.11 29.5 39.04 20.56 30.91
PD w. Module 29.09 30.09 22.49 28.64 31.86 41.60 29.53 39.04 21.25 31.11
PD w. Operation 29.26 30.11 23.01 28.6 33.06 42.38 29.94 39.54 20.89 31.14
Languages FIEN IDEN MSEN PLEN CSEN
Direction
Baselines Bilingual Vaswani et al. (2017) 13.92 18.34 21.29 25.61 16.75 21.24 13.46 19.05 16.82 25.27
Multilingual Johnson et al. (2017) 15.58 21.43 22.85 28.27 18.12 23.66 14.87 22.24 18.57 28.14
Random Sharing 16.01 21.30 21.69 27.78 17.13 23.73 15.23 21.97 18.40 28.21
Tan et al. (2019) 16.15 21.46 22.74 28.00 18.12 23.14 14.86 21.72 18.02 28.08
Sachan and Neubig (2018) 16.37 21.36 22.39 29.60 17.33 23.77 15.75 22.45 19.70 28.59
Ours PD w. Layer 16.42 22.37 22.89 29.28 18.35 24.88 16.07 23.11 19.29 29.31
PD w. Module 16.44 22.85 22.94 28.86 17.62 24.27 16.18 23.12 19.33 29.08
PD w. Operation 16.59 22.85 23.09 29.03 18.61 25.27 16.45 23.34 19.46 29.66
Languages UKEN RUEN Average Average Model Size
Direction
Baselines Bilingual Vaswani et al. (2017) 10.06 18.68 21.63 26.61 20.38 25.86 -0.36 -2.22 12x 12x
Multilingual Johnson et al. (2017) 11.59 21.76 20.96 28.76 20.74 28.08 0 0 1x 1x
Random Sharing 11.57 21.83 21.36 28.91 20.93 28.23 +0.19 +0.15 1.98x 2.00x
Tan et al. (2019) 11.32 21.74 21.32 28.73 21.20 28.39 +0.46 +0.31 2x 2x
Sachan and Neubig (2018) 10.96 21.88 22.28 28.80 21.47 28.62 +0.73 +0.54 3.71x 3.25x
Ours PD w. Layer 12.32 22.68 22.82 30.37 21.85 29.40 +1.11 +1.32 2.14x 1.84x
PD w. Module 12.55 22.44 22.31 30.39 21.80 29.29 +1.06 +1.21 1.82x 1.94x
PD w. Operation 12.37 23.05 22.98 30.60 22.14 29.63 +1.40 +1.55 1.96x 1.90x
Table 3: BLEU scores on the OPUS dataset. We compare our method with different levels of parameter sharing in both one-to-many () and many-to-one () directions. We report our parameter differentiation (PD) method with different granularity: Layer, Module and Operation. Bold indicates the best result of all methods.

4.2 Model Settings

We conduct our experiments with the Transformer architecture and adopt the transformer_base setting which includes encoder and decoder layers, / hidden dimensions and attention heads. Dropout () and label smoothing () are applied during training but disabled during validation and inference. Each mini-batch contains roughly tokens. We accumulate gradients and update the model every steps for OPUS and steps for WMT to simulate multi-GPU training. In inference, we use beam search with the beam size of and the length penalty of . We measure the translation quality by BLEU score Papineni et al. (2002) with SacreBLEU111https://github.com/mjpost/sacrebleu. All the models are trained and tested on a single Nvidia V100 GPU.

Our method allows the parameters to differentiate into specialized types by duplication and reallocation, which may results in bilingual models with unlimited parameter differentiation, i.e., each parameter is only shared by one task in the final model. To prevent over-specialization and make a fair comparison, we set a differentiation upper bound defined by the expected final model size , and let the model control the number of parameters (denoted as ) to differentiate222 Since the parameters are grouped into differentiation units under a certain granularity, the value of and may fluctuate to comply with the granularity.:

(4)

where is the size of the original completely shared model. The total training step is set to k for all experiments, and the differentiation happens every steps of training. We set the expected model size to , times of original model. We also analyze the relationship between model size and translation quality by varying in the range from to .

4.3 Baseline Systems

We compare our method with several baseline methods with different paradigms of parameter sharing.

Bilingual trains Transformer model Vaswani et al. (2017) for each translation direction and results in individual models for translation directions.

Multilingual adopts the standard paradigm of MNMT that all parameters are shared across tasks Johnson et al. (2017).

Random Sharing selects parameters for differentiation randomly (with Operation granularity) instead of using inter-task gradient similarity.

Sachan and Neubig (2018) uses a partially shared model that proved effective empirically. They share the attention key and query of the decoder, the embedding, and the encoder in a one-to-many model. We extend the settings for the many-to-one model that share the attention key and query of the encoder, the embedding, and the decoder.

Tan et al. (2019) first clusters the languages using the language embedding vectors in the Multilingual method and then trains one model for each cluster. To make the model size comparable with our method, we set the number of clusters as and train two distinct models. In our experiment on the OPUS dataset, this method results in two clusters: FR, IT, ID, MS, PL, CS, UK, RU and LV, LT, ET, FI.

Languages FREN CSEN DEEN ETEN ROEN Average Sizes
Direction
Bilingual 39.87 37.74 27.23 31.43 26.71 31.98 17.55 23.26 23.13 29.23 26.90 30.73 5x 5x
Multilingual 38.07 36.23 25.39 30.77 24.67 31.54 18.90 26.09 26.42 34.85 26.69 31.90 1x 1x
Ours 40.28 37.36 26.75 32.92 27.29 32.80 19.66 27.64 27.34 35.90 28.26 33.32 1.83x 1.87x
Table 4: Results on the WMT dataset. Our method is parameter differentiation with granularity of Operation. Bold indicates the best result for multilingual model while the overall best results are underlined.
DE EN IT NL RO Average
DE - 33.09 / 34.07 20.10 / 21.05 22.18 / 22.35 18.20 / 19.00 23.39 / 24.12
EN 26.55 / 27.86 - 27.74 / 28.69 28.15 / 28.61 25.34 / 26.58 26.95 / 27.94
IT 19.32 / 20.37 32.14 / 32.99 - 20.05 / 20.47 19.28 / 20.10 22.70 / 23.48
NL 21.06 / 22.09 32.54 / 33.45 19.81 / 20.64 - 17.93 / 18.81 22.84 / 23.75
RO 20.74 / 21.39 34.78 / 35.76 22.96 / 23.53 20.87 / 21.04 - 24.84 / 25.43
Average 21.92 / 22.93 33.14 / 34.07 22.65 / 23.48 22.81 / 23.12 20.19 / 21.12 30.08 / 31.18
Table 5: The many-to-many translation results on the IWSLT dataset. Our parameter differentiation method is based on the granularity of Operation. We compare our method with the Multilingual method and report the result with format of Multilingual/Ours. Bold indicates the better result.

4.4 Results

OPUS Table 3 shows the results of our method and the baseline methods on the OPUS dataset. In both one-to-many () and many-to-one () directions, our methods consistently outperform the Bilingual and Multilingual baselines and gains improvement over the Multilingual baseline by up to and BLEU on average. Compared to other parameter sharing methods, our method achieves the best results in of translation directions and improves the average BLEU by a large margin. As for the different granularities in our method, we find that the Operation level achieves the best results on average, due to the fine-grained control of parameter differentiation compared to the Layer level and the Module level.

For the model sizes, the method of Sachan and Neubig (2018) that pre-defines the sharing modules increases linearly with the number of languages involved and results in a larger model size (x). In our method, the model size is unrelated to the number of languages, which provides more scalability and flexibility. Since we use different granularities instead of performing differentiation on every single parameter, the actual sizes of our method range from x to x, close but not equal to the predefined x.

WMT We further investigate the generalization performance with experiments on the unbalanced WMT dataset. As shown in Table 4, the Multilingual model benefits lower-resource languages (ET, RO) translation but hurts the performances of higher-resource languages (FR, CS, DE). In contrast, our method gains more improvements in higher-resource language (+2.21 for FREN) than lower-resource language (+1.05 for ROEN). Our method can also outperform the Bilingual method in of translation directions.

IWSLT The results on the many-to-many translation scenario with the IWSLT dataset are shown in Table 5. Our method based on Operation level granularity outperforms the Multilingual baseline in all translation directions, but the improvement ( BLEU on average) is less significant than those on the other two datasets. The reason is that the languages in the IWSLT dataset belong to the same Indo-European language family and thus the shared parameters may be sufficient for modeling all translation directions.

4.5 Analyses

Parameter Differentiation Across Layers

Using a shared encoder for one-to-many translation and a shared decoder for many-to-one translation has been proved effective and is widely used Zoph and Knight (2016); Dong et al. (2015); Sachan and Neubig (2018). However, there lack of analyses on different sharing strategies across layers. The parameter differentiation method provides a more fine-grained control of parameter sharing, making it possible to offer such analyses. To investigate the parameter sharing across layers, we calculate the number of differentiation units within each layer of the final model trained with Operation level granularity. For comparison, the completely shared model has differentiation units in each encoder layer and in each decoder layer (see details in supplementary materials).

The results are shown in Figure 3. For many-to-one translation, the task-specific parameters are mainly distributed in shallower layers of the encoder and the parameters in the decoder tend to stay shared. On contrary, for one-to-many translation, the decoder has more task-specific parameters than the encoder. Different from the encoder in which shallower layers are slightly more task-specific, both the shallower and the deeper layers are more specific than the middle layers in the decoder. The reason is that the shallower layers in the decoder take tokens from multiple languages as input and the deeper layers are responsible for generating tokens in multiple languages.

Figure 3: The number of differentiation units within each layer of the final model.

Parameter Differentiation and Language Family

We investigate the correlation between the parameter sharing obtained by differentiation and the language families. Intuitively, linguistically similar languages are more likely to have shared parameters. To verify this, we first select encoder.layer-0.self-attention.value-projection, which differentiate for the most times and is the most specialized, and then analyze its differentiation process during training.

Figure 4 shows the differentiation process of the most specialized parameter. From the training steps, we can find that the differentiation happens aperiodically for this parameter. As for the differentiation results, it is obvious that the parameter sharing strategy is highly correlated with the linguistic proximity like language family or language branch. For example, ID and MS belong to the Austronesian language and share the parameters while ID and FR belonging to the Austronesian language and the Romance language respectively have task-specific parameters. Another interesting observation is that the Baltic languages (LV and LT) become specialized at the early stage of training. We examine the OPUS dataset and find out that the training data of LV and LT are mainly from the political domain, while other languages are mainly from the spoken domain.

Figure 4: The differentiation process of the parameter group encoder.layer-0.self-attention.value-projection. Parameters are shared across languages in a square and the colors represent linguistic proximities.
Figure 5: The correlation between model size and the average BLEU over all language pairs on the OPUS dataset in both one-to-many and many-to-one directions.

The Effect of Model Size

We notice that the model size is not completely correlated with performance according to the results in Table 3. Our method initialize the model as completely shared with the model size of x, and may differentiate into bilingual models in extreme cases. The completely shared model tends to preserve the general knowledge, while the bilingual models only capture language-specific knowledge. To investigate the effect of the differentiation level, we evaluate the relationship between model size and translation quality.

As shown in Figure 5, the performance first increases with a higher differentiation level (larger model size) and then decreases when the model grows over a certain threshold. The best results are obtained with x and x model sizes for one-to-many and many-to-one directions respectively, which indicates that the model needs more parameters for handling multiple target languages (one-to-many) than multiple source languages (many-to-one).

5 Related Work

Multilingual neural machine translation (MNMT) aims at handling translation between multiple languages with a single model Dabre et al. (2020). In the early stage, researchers share different modules like encoder Dong et al. (2015), decoder Zoph and Knight (2016), or attention mechanism Firat et al. (2016) to reduce the parameter scales in bilingual models. The success in sharing modules motivates a more aggressive parameter sharing that handles all languages with a completely shared model Johnson et al. (2017); Ha et al. (2016).

Despite its simplicity, the completely shared model faces capacity bottlenecks for retaining specific knowledge of each language Aharoni et al. (2019). Researchers resort to language specific modeling with various parameter sharing strategies Sachan and Neubig (2018); Wang et al. (2019b, 2018), such as the attention module Wang et al. (2019a); Blackwood et al. (2018); He et al. (2021), decoupling encoder or decoder Escolano et al. (2021), additional adapters Bapna and Firat (2019), and language clustering Tan et al. (2019).

Instead of augmenting the model with manually designed language-specific modules, researchers attempt to search for a language-specific sub-space of the model, such as generating the language-specific parameters from global ones Platanios et al. (2018), language-aware model depth Li et al. (2020), language-specific routing path Zhang et al. (2021) and language-specific sub-networks Xie et al. (2021); Lin et al. (2021). These methods start from a large model that covers all translation directions, where the size and structure of the initial model are non-trivial to determine. While our method initializes a simple shared model and lets the model to automatically grows into a more complicated one, which provides more scalability and flexibility.

6 Conclusion and Future Work

In this paper, we propose a novel parameter differentiation based method that can automatically determine which parameters should be shared and which ones should be language-specific. The shared parameters can dynamically differentiate into more specialized types during training. The extensive experiments on three multilingual machine translation datasets verify the effectiveness of our method. The analyses reveal that the parameter sharing configurations obtained by our method are highly correlated with the linguistic proximities. In the future, we want to let the model learn when to stop differentiation and explore other differentiation criteria for more multilingual scenarios like the zero-shot translation and the incremental multilingual translation.

7 Acknowledgement

The research work descried in this paper has been supported by the Natural Science Foundation of China under Grant No. U1836221 and 62122088, and also by Beijing Academy of Artificial Intelligence (BAAI).

References

  • R. Aharoni, M. Johnson, and O. Firat (2019) Massively multilingual neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio (Eds.), pp. 3874–3884. External Links: Link, Document Cited by: §5.
  • N. Arivazhagan, A. Bapna, O. Firat, D. Lepikhin, M. Johnson, M. Krikun, M. X. Chen, Y. Cao, G. F. Foster, C. Cherry, W. Macherey, Z. Chen, and Y. Wu (2019) Massively multilingual neural machine translation in the wild: findings and challenges. CoRR abs/1907.05019. External Links: Link, 1907.05019 Cited by: §1, §4.1.
  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Link Cited by: §1.
  • A. Bapna and O. Firat (2019) Simple, scalable adaptation for neural machine translation. In

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019

    , K. Inui, J. Jiang, V. Ng, and X. Wan (Eds.),
    pp. 1538–1548. External Links: Link, Document Cited by: §1, §5.
  • G. W. Blackwood, M. Ballesteros, and T. Ward (2018) Multilingual neural machine translation with task-specific attention. In Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, E. M. Bender, L. Derczynski, and P. Isabelle (Eds.), pp. 3112–3122. External Links: Link Cited by: §1, §5.
  • R. Dabre, C. Chu, and A. Kunchukuttan (2020) A survey of multilingual neural machine translation. ACM Comput. Surv. 53 (5), pp. 99:1–99:38. External Links: Link, Document Cited by: §5.
  • D. Dong, H. Wu, W. He, D. Yu, and H. Wang (2015) Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, pp. 1723–1732. External Links: Link, Document Cited by: §4.5, §5.
  • C. Escolano, M. R. Costa-jussà, J. A. R. Fonollosa, and M. Artetxe (2021) Multilingual machine translation: closing the gap between shared and language-specific encoder-decoders. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021, Online, April 19 - 23, 2021, P. Merlo, J. Tiedemann, and R. Tsarfaty (Eds.), pp. 944–948. External Links: Link Cited by: §5.
  • O. Firat, K. Cho, and Y. Bengio (2016) Multi-way, multilingual neural machine translation with a shared attention mechanism. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pp. 866–875. External Links: Link, Document Cited by: §5.
  • T. Ha, J. Niehues, and A. H. Waibel (2016) Toward multilingual neural machine translation with universal encoder and decoder. In 13th International Workshop on Spoken Language Translation 2016, pp. 5. Cited by: §1, §5.
  • H. He, Q. Wang, Z. Yu, Y. Zhao, J. Zhang, and C. Zong (2021) Synchronous interactive decoding for multilingual neural machine translation. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pp. 12981–12988. External Links: Link Cited by: §5.
  • M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. B. Viégas, M. Wattenberg, G. Corrado, M. Hughes, and J. Dean (2017) Google’s multilingual neural machine translation system: enabling zero-shot translation. Trans. Assoc. Comput. Linguistics 5, pp. 339–351. External Links: Link Cited by: §1, §2, §3.1, §3.4, §4.3, Table 3, §5.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §3.4.
  • X. Li, A. C. Stickland, Y. Tang, and X. Kong (2020) Deep transformers with latent depth. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Link Cited by: §5.
  • Z. Lin, L. Wu, M. Wang, and L. Li (2021) Learning language specific sub-network for multilingual machine translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, C. Zong, F. Xia, W. Li, and R. Navigli (Eds.), pp. 293–305. External Links: Link, Document Cited by: §1, §5.
  • S. Lyu, B. Son, K. Yang, and J. Bae (2020) Revisiting modularized multilingual NMT to meet industrial demands. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, B. Webber, T. Cohn, Y. He, and Y. Liu (Eds.), pp. 5905–5918. External Links: Link, Document Cited by: §1.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §4.2.
  • E. A. Platanios, M. Sachan, G. Neubig, and T. M. Mitchell (2018) Contextual parameter generation for universal neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii (Eds.), pp. 425–435. External Links: Link, Document Cited by: §5.
  • D. S. Sachan and G. Neubig (2018) Parameter sharing methods for multilingual self-attentional translation models. In Proceedings of the Third Conference on Machine Translation: Research Papers, WMT 2018, Belgium, Brussels, October 31 - November 1, 2018, O. Bojar, R. Chatterjee, C. Federmann, M. Fishel, Y. Graham, B. Haddow, M. Huck, A. Jimeno-Yepes, P. Koehn, C. Monz, M. Negri, A. Névéol, M. L. Neves, M. Post, L. Specia, M. Turchi, and K. Verspoor (Eds.), pp. 261–271. External Links: Link, Document Cited by: §1, §4.3, §4.4, §4.5, Table 3, §5.
  • R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, External Links: Link, Document Cited by: §4.1.
  • A. Siddhant, M. Johnson, H. Tsai, N. Ari, J. Riesa, A. Bapna, O. Firat, and K. Raman (2020) Evaluating the cross-lingual effectiveness of massively multilingual neural machine translation. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 8854–8861. External Links: Link Cited by: §1.
  • J. M. Slack (2007) Metaplasia and transdifferentiation: from pure biology to the clinic. Nature reviews Molecular cell biology 8 (5), pp. 369–378. Cited by: §3.1.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014)

    Sequence to sequence learning with neural networks

    .
    In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp. 3104–3112. External Links: Link Cited by: §1.
  • X. Tan, J. Chen, D. He, Y. Xia, T. Qin, and T. Liu (2019) Multilingual neural machine translation with language clustering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, K. Inui, J. Jiang, V. Ng, and X. Wan (Eds.), pp. 963–973. External Links: Link, Document Cited by: §4.3, Table 3, §5.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §1, §2, §4.3, Table 3.
  • Y. Wang, J. Zhang, F. Zhai, J. Xu, and C. Zong (2018) Three strategies to improve one-to-many multilingual translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii (Eds.), pp. 2955–2960. External Links: Link, Document Cited by: §5.
  • Y. Wang, J. Zhang, L. Zhou, Y. Liu, and C. Zong (2019a) Synchronously generating two languages with interactive decoding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, K. Inui, J. Jiang, V. Ng, and X. Wan (Eds.), pp. 3348–3353. External Links: Document, Link Cited by: §5.
  • Y. Wang, L. Zhou, J. Zhang, F. Zhai, J. Xu, and C. Zong (2019b) A compact and language-sensitive multilingual translation method. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 1213–1223. External Links: Link, Document Cited by: §1, §5.
  • Z. Wang, Y. Tsvetkov, O. Firat, and Y. Cao (2021) Gradient vaccine: investigating and improving multi-task optimization in massively multilingual models. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, External Links: Link Cited by: §3.2.
  • W. Xie, Y. Feng, S. Gu, and D. Yu (2021)

    Importance-based neuron allocation for multilingual neural machine translation

    .
    In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, C. Zong, F. Xia, W. Li, and R. Navigli (Eds.), pp. 5725–5737. External Links: Link, Document Cited by: §1, §5.
  • T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn (2020) Gradient surgery for multi-task learning. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Link Cited by: §3.2.
  • B. Zhang, A. Bapna, R. Sennrich, and O. Firat (2021) Share or not? learning to schedule language-specific capacity for multilingual translation. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, External Links: Link Cited by: §1, §5.
  • B. Zhang, P. Williams, I. Titov, and R. Sennrich (2020) Improving massively multilingual neural machine translation and zero-shot translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault (Eds.), pp. 1628–1639. External Links: Link, Document Cited by: §1, §4.1.
  • B. Zoph and K. Knight (2016) Multi-source neural translation. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pp. 30–34. External Links: Link, Document Cited by: §4.5, §5.