Transfer learning has been shown to work well in Computer Vision where pre-trained components from a model trained on ImageNet[Krizhevsky, Sutskever, and Hinton2012] are used to initialize models for other tasks [Yosinski et al.2014]
. In most cases, the other tasks are related to and share architectural components with the ImageNet task, enabling the use of such pre-trained models for feature extraction. With this transfer capability, improvements have been obtained on other image classification datasets, and on other tasks such as object detection, action recognition, image segmentation, etc[Huh, Agrawal, and Efros2016]. Analogously, we propose a method to transfer a pre-trained component - the multilingual encoder from an NMT system - to other NLP tasks.
In NLP, initializing word embeddings with pre-trained word representations obtained from Word2Vec [Mikolov et al.2013] or GloVe [Pennington, Socher, and Manning2014] has become a common way of transferring information from large unlabeled data to downstream tasks. Recent work has further shown that we can improve over this approach significantly by considering representations in context, i.e. modeled depending on the sentences that contain them, either by taking the outputs of an encoder in MT [McCann et al.2017] or by obtaining representations from the internal states of a bi-directional Language Model (LM) [Peters et al.2018]. There has also been successful recent work in transferring sentence representations from resource-rich tasks to improve resource-poor tasks [Conneau et al.2017], however, most of the above transfer learning examples have focused on transferring knowledge across tasks for a single language, in English.
Cross-lingual or multilingual NLP, the task of transferring knowledge from one language to another, serves as a good test bed for evaluating various transfer learning approaches. For cross-lingual NLP, the most widely studied approach is to use multilingual embeddings as features in neural network models. However, research has shown that representations learned in context are more effective[McCann et al.2017, Peters et al.2018]; therefore, we aim at doing better than just using multilingual embeddings in the cross-lingual tasks. Recent progress in multilingual NMT provides a compelling opportunity for obtaining contextualized multilingual representations, as multilingual NMT systems are capable of generalizing to an unseen language direction, i.e. zero-shot translation. There is also evidence that the encoder of a multilingual NMT system learns language agnostic, universal interlingua representations, which can be further exploited [Johnson et al.2017].
In this paper, we focus on using the representations obtained from a multilingual NMT system to enable cross-lingual transfer learning on downstream NLP tasks. Our contributions are three-fold:
We show that by simply reusing the encoder of a multilingual NMT system trained to translate from English to French (EnFr) and from French to English (FrEn), we can significantly improve the performance over the baseline in three downstream tasks - Amazon Reviews, Stanford Sentiment Treebank (SST), and Stanford Natural Language Inference (SNLI).
We demonstrate that our approach is able to perform zero-shot classification, i.e. performing classification in a new language, e.g. French, without ever seeing any French classification data during training. Our approach obtains surprisingly high zero-shot classification accuracy in French on all three tasks.
We carefully analyze how and why cross-lingual knowledge transfer works, and study the effect of various factors on zero-shot classification performance.
2 Related Work
Word and Sentence Representations.
, have been shown to be a key ingredient in many standard NLP tasks. The tasks include sentiment analysis[Socher et al.2013], entailment [Bowman et al.2015], summarization [Nallapati, Xiang, and Zhou2016], question answering [Liu et al.2018], and semantic role labeling [He et al.2017]. However, these representations are usually learned from unsupervised data sources which are often unrelated to the downstream task.
Several studies have overcome the fact that these representations are context-independent by proposing contextualized word embeddings. Representations obtained from an LM have been shown to obtain effective contextualized word representations [Peters et al.2017, Peters et al.2018]. There has also been work in enriching these word representations using sub-word information [Wieting et al.2016, Bojanowski et al.2017]
. MT naturally lends itself as a suitable task for obtaining contextualized embeddings since the encoder has to encode units in context so as to decode them into another language. In Hill:2017:RGW:3127662.3127714 Hill:2017:RGW:3127662.3127714, the authors show the effectiveness of representations obtained from an NMT model in semantic similarity tasks. They further report that the representations obtained from the NMT model are better than those obtained from LMs. NIPS2017_7209 NIPS2017_7209 showed that using the representations obtained from the encoder of an NMT system as context vectors in downstream NLP tasks significantly improves performance over using only unsupervised word or character-gram vectors.
Finally, there has been a large body of work on obtaining transferable sentence representations. In D17-1070 D17-1070, the authors obtain representations from the supervised SNLI task and show that these are effective for transferring to other tasks. Their method outperforms other similar approaches to obtain representations like FastSent [Hill, Cho, and Korhonen2016] and SkipThought [Kiros et al.2015]. Finally, arora2016simple arora2016simple show that a simple average of word embeddings approach is competitive with more complex methods like SkipThought representations.
Cross-lingual or Multilingual Representations.
Previous approaches to cross-lingual or multilingual representations have fallen into three categories. Obtaining representations from word level alignments - bilingual dictionaries or automatically generated word alignments - is the most popular approach [Mikolov et al.2013, Faruqui and Dyer2014, Zou et al.2013]. The second category of methods try to leverage document level alignment like parallel Wikipedia articles to generate cross-lingual representations [Søgaard et al.2015, Vulić and Moens2016].
The final category of methods often use sentence level alignments in the form of parallel translation data to obtain cross-lingual representations. hermann-blunsom:2014:P14-1 hermann-blunsom:2014:P14-1 propose a deep neural model named BiCVM which compares two sentence representations at the final layer and forces them into the same intermediate sentence representation. BilBOWA [Gouws, Bengio, and Corrado2015] is a simpler model which extends skip-gram with negative sampling [Mikolov et al.2013] to optimize each word’s similarity with its context in both the current language and the other parallel language. luong-pham-manning:2015:VSM-NLP luong-pham-manning:2015:VSM-NLP also proposed obtaining cross-lingual representations using a similar approach. DBLP:journals/corr/AmmarMTLDS16 DBLP:journals/corr/AmmarMTLDS16 propose two algorithms, multiCluster and multiCCA, for learning multilingual representations from a set of bilingual lexical data.
Our paper aims to combine the best of both worlds by learning contextualized representations which are multilingual in nature. We demonstrate that using the encoder from a multilingual NMT system as a pre-trained component in other downstream NLP tasks improves performance in English and also enables cross-lingual transfer learning in French.
3 Proposed Method
We propose an Encoder-Classifier model, where the Encoder, leveraging the representations learned by a multilingual NMT model, converts an input sequence into a set of vectors C, and the Classifier predicts a class label given the encoding of the input sequence, C.
3.1 Multilingual Representations Using NMT
Although there has been a large body of work in building multilingual NMT models which can translate between multiple languages at the same time [Luong et al.2016, Dong et al.2015, Firat, Cho, and Bengio2016, Johnson et al.2017], zero-shot capabilities of such multilingual representations have only been tested for MT [Johnson et al.2017]. We propose a simple yet effective solution - reuse the encoder of a multilingual NMT model to initialize the encoder for other NLP tasks. To be able to achieve promising zero-shot classification performance, we consider two factors: (1) The ability to encode multiple source languages with the same encoder and (2) The ability to learn language agnostic representations of the source sequence. Based on the literature, both requirements can be satisfied by training a multilingual NMT model having a shared encoder [Lee, Cho, and Hofmann2017, Johnson et al.2017], and a separate decoder and attention mechanism for each target language [Dong et al.2015]. After training such a multilingual NMT model, the decoder and the corresponding attention mechanisms (which are target-language specific) are discarded, while the multilingual encoder is used to initialize the encoder of our proposed Encoder-Classifier model.
3.2 Multilingual Encoder-Classifier
In order to leverage pre-trained multilingual representations introduced in Section 3.1
, our encoder strictly follows the structure of a regular Recurrent Neural Network (RNN) based NMT encoder[Bahdanau, Cho, and Bengio2016] with a stacked layout [Wu et al.2016]. Given an input sequence of length , our encoder contextualizes or encodes the input sequence into a set of vectors C, by first applying a bi-directional RNN [Schuster and Paliwal1997], followed by a stack of uni-directional RNNs. The hidden states of the final layer RNN, , form the set C of context vectors which will be used by the classifier, where denotes the number of RNN layers in the stacked encoder.
The task of the classifier is to predict a class label given the context set C. To ease this classification task given a variable length input set C, a common approach in the literature is to extract a single sentence vector by making use of pooling over time [Collobert et al.2011]. Further, to increase the modeling capacity, the pooling operation can be parameterized using pre- and post-pooling networks. Formally, given the context set C, we extract a sentence vector in three steps, using three networks, (1) pre-pooling feed-forward network , (2) pooling network and (3) post-pooling feed-forward network ,
Finally, given the sentence vector , a class label is predicted by employing a softmax function.
4 Experimental Design
We evaluate the proposed method on three common NLP tasks: Amazon Reviews, SST and SNLI. We utilize parallel data to train our multilingual NMT system, as detailed below.
For the MT task, we use the WMT 2014 EnFr parallel corpus. The dataset contains 36 million EnFr sentence pairs. We swapped the source and target sentences to obtain parallel data for the FrEn translation task. We use these two datasets (72 million sentence pairs) to train a single multilingual NMT model to learn both these translation directions simultaneously. We generated a shared sub-word vocabulary [Sennrich, Haddow, and Birch2016, Schuster and Nakajima2012] of 32K units from all source and target training data. We use this sub-word vocabulary for all of our experiments below.
The Amazon reviews dataset [Prettenhofer and Stein2010] is a multilingual sentiment classification dataset, providing data for four languages - English (En), French (Fr), German (De), and Japanese. We use the English and French datasets in our experiments. The dataset contains 6,000 documents in the train and test portions for each language. Each review consists of a category label, a title, a review, and a star rating (5-point scale). We only use the review text in our experiments. Following Prettenhofer:2010:CTC:1858681.1858795 Prettenhofer:2010:CTC:1858681.1858795, we mapped the reviews with lower scores (1 and 2) to negative examples and the reviews with higher scores (4 and 5) to positive examples, thereby turning it into a binary classification problem. Reviews with score 3 are dropped. We split the training dataset into 10% for development and the rest for training, and we truncate each example and keep the first 200 words in the review. Note that, since the data for each language was obtained by crawling different product pages, the data is not aligned across languages.
The sentiment classification task proposed in socher-EtAl:2013:EMNLP socher-EtAl:2013:EMNLP is also a binary classification problem where each sentence and phrase is associated with either a positive or a negative sentiment. We ignore phrase-level annotations and sentence-level neutral examples in our experiments. The dataset contains 6920, 872, and 1821 examples for training, development and testing, respectively. Since SST does not provide a multilingual test set, we used the public translation engine Google Translate111https://translate.google.com as of October 2017. to translate the SST test set to French. Previous work by lrec2018 lrec2018 has shown that replacing the human translated test set with a synthetic set (obtained by using Google Translate) produces only a small difference of around 1% absolute accuracy on their human-translated French SNLI test set. Therefore, the performance measured on our ‘pseudo’ French SST test set is expected to be a good indicator of zero-shot performance.
Natural language inference is a task that aims to determine whether a natural language hypothesis can justifiably be inferred from a natural language premise . SNLI [Bowman et al.2015] is one of the largest datasets for a natural language inference task in English and contains multiple sentence pairs with a sentence-level entailment label. Each pair of sentences can have one of three labels - entailment, contradiction, and neutral, which are annotated by multiple humans. The dataset contains 550K training, 10K validation, and 10K testing examples. To enable research on multilingual SNLI, lrec2018 lrec2018 chose a subset of the SNLI test set (1332 sentences) and professionally translated it into four major languages - Arabic, French, Russian, and Spanish. We use the French test set for evaluation in Section 6 and 7.
|Model||Amazon (En)||Amazon (Fr)||SST (En)||SNLI (En)|
|Proposed model: Encoder-Classifier||76.60||82.50||79.63||76.70|
|+ Pre-trained Encoder||80.70||83.18||84.18||84.42|
|+ Freeze Encoder||84.13||85.65||84.51||84.41|
4.2 Model and Training Details
Here, we first describe the model and training details of the base multilingual NMT model whose encoder is reused in all other tasks. Then we provide details about the task-specific classifiers. For each task, we provide the specifics of , and nets that build the task-specific classifier.
All the models in our experiments are trained using Adam optimizer [Kingma and Ba2014] with label smoothing [Szegedy et al.2016] and unless otherwise stated below, layer normalization [Ba, Kiros, and Hinton2016] is applied to all LSTM gates and feed-forward layer inputs. We apply L2 regularization to the model weights and dropout to layer activations and sub-word embeddings. Hyper-parameters, such as mixing ratio of L2 regularization, dropout rates, label smoothing uncertainty, batch sizes, learning rate of optimizers and initialization ranges of weights are tuned on the development sets provided for each task separately.
Our multilingual NMT model consists of a shared multilingual encoder and two decoders, one for English and the other for French. The multilingual encoder uses one bi-directional LSTM, followed by three stacked layers of uni-directional LSTMs in the encoder. Each decoder consists of four stacked LSTM layers, with the first LSTM layers intertwined with additive attention networks [Bahdanau, Cho, and Bengio2016]
to learn a source-target alignment function. All the uni-directional LSTMs are equipped with residual connections[He et al.2016] to ease the optimization, both in the encoder and the decoders. LSTM hidden units and the shared source-target embedding dimensions are set to 512.
Similar to dong2015multi dong2015multi, multilingual NMT model is trained in a multi-task learning setup, where each decoder is augmented with a task-specific loss, minimizing the negative conditional log-likelihood of the target sequence given the source sequence. During training, mini-batches of EnFr and FrEn examples are interleaved. We picked the best model based on the best average development set BLEU score on both of the language pairs.
Amazon Reviews and SST
The Encoder-Classifier model here uses the encoder defined previously. With regards to the classifier, the pre- and post-pooling networks (,
) are both one-layer feed forward networks to cast the dimension size from 512 to 128 and from 128 to 32, respectively. We used max-pooling operator for thenetwork to pool the activation over time.
We extended the proposed Encoder-Classifier model to a multi-source model [Zoph and Knight2016] since SNLI is an inference task of relations between two input sentences, “premise” and “hypothesis”. For the two sources, we use two separate encoders, which are initialized with the same pre-trained multilingual NMT encoder, to obtain their representations. Following our notation, the encoder outputs are processed using , and nets, again with two separate network blocks. Specifically, consists of a co-attention layer [Lu et al.2016]
followed by a two-layer feed-forward neural network with residual connections. We use max pooling over time forand again a two-layer feed-forward neural network with residual connections as . After processing two sentence encodings using two network blocks, we obtain two vectors representing premise and hypothesis . Following tai-socher-manning:2015:ACL-IJCNLP tai-socher-manning:2015:ACL-IJCNLP, we compute two types of relational vectors with and , where denotes the element-wise multiplication between two vectors. The final relation vector is obtained by concatenating and . For both “premise” and “hypothesis” feed-forward networks we used 512 hidden dimensions.
For Amazon Reviews, SST and SNLI tasks, we picked the best model based on the highest development set accuracy.
|Model||Amazon (Fr)||SST (Fr)||SNLI (Fr)|
|Proposed model: Encoder-Classifier||73.30||51.53||79.63||59.47||74.41||37.62|
|+ Pre-trained Encoder||79.23||75.78||84.18||81.05||80.65||72.35|
|+ Freeze Encoder||83.10||81.32||84.51||83.14||81.26||73.88|
5 Transfer Learning Results
In this section, we report our results for the three tasks - Amazon Reviews (English and French), SST, and SNLI. For each task, we first build a baseline system using the proposed Encoder-Classifier architecture described in Section 3 where the encoder is initialized randomly. Next, we experiment with using the pre-trained multilingual NMT encoder to initialize the system as described in Section 3.1. Finally, we perform an experiment where we freeze the encoder after initialization and only update the classifier component of the system.
Table 1 summarizes the accuracy of our proposed system for these three different approaches and the state-of-the-art results on all the tasks. The first row in the table shows the baseline accuracy of our system for all four datasets. The second row shows the result from initializing with a pre-trained multilingual NMT encoder. It can be seen that this provides a significant improvement in accuracy, an average of 4.63%, across all the tasks. This illustrates that the multilingual NMT encoder has successfully learned transferable contextualized representations that are leveraged by the classifier component of our proposed system. These results are in line with the results in NIPS2017_7209 NIPS2017_7209 where the authors used the representations from the top NMT encoder layer as an additional input to the task-specific system. However, in our setup we reused all of the layers of the encoder as a single pre-trained component in the task-specific system. The third row shows the results from freezing the pre-trained encoder after initialization and only training the classifier component. For the Amazon English and French tasks, freezing the encoder after initialization significantly improves the performance further. We hypothesize that since the Amazon dataset is a document level classification task, the long input sequences are very different from the short sequences consumed by the NMT system and hence freezing the encoder seems to have a positive effect. This hypothesis is also supported by the SNLI and SST results, which contain sentence-level input sequences, where we did not find any significant difference between freezing and not freezing the encoder.
6 Zero-Shot Classification Results
In this section, we explore the zero-shot classification task in French for our systems. We assume that we do not have any French training data for all the three tasks and test how well our proposed method can generalize to the unseen French language without any further training. Specifically, we reuse the three proposed systems from Table 1 after being trained only on the English classification task and test the systems on data from an unseen language (e.g. French). A reasonable upper bound to which zero-shot performance should be compared to is bridging - translating a French test text to English and then applying the English classifier on the translated text. If we assume the translation to be perfect, we should expect this approach to perform as well as the English classifier.
The Amazon Reviews and SNLI tasks have a French test set available, and we evaluate the performance of the bridged and zero-shot systems on each French set. However, the SST dataset does not have a French test set, hence the ‘pseudo French’ test set described in Section 4.1 is used to evaluate the zero-shot performance. We use the English accuracy scores from the SST column in Table 1 as a high-quality proxy for the SST bridged system. We do this since translating the ‘pseudo French’ back to English will result in two distinct translation steps and hence more errors.
Table 2 summarizes all of our zero-shot results for French classification on the three tasks. It can be seen that just by using the pre-trained NMT encoder, the zero-shot performance increases drastically from almost random to within 10% of the bridged system. Freezing the encoder further pushes this performance closer to the bridged system. On the Amazon Review task, our zero-shot system is within 2% of the best bridged system. On the SST task, our zero-shot system obtains an accuracy of 83.14% which is within 1.5% of the bridged equivalent (in this case the English system).
Finally, on SNLI, we compare our best zero-shot system with bilingual and multilingual embedding based methods evaluated on the same French test set in lrec2018 lrec2018. As illustrated in Table 3, our best zero-shot system obtains the highest accuracy of 73.88%. INVERT [Søgaard et al.2015] uses inverted indexing over a parallel corpus to obtain crosslingual word representations. BiCVM [Hermann and Blunsom2014] learns bilingual compositional representations from sentence-aligned parallel corpora. In RANDOM [Vulić and Moens2016], bilingual embeddings are trained on top of parallel sentences with randomly shuffled tokens using skip-gram with negative sampling, and RATIO is similar to RANDOM with the one difference being that the tokens in the parallel sentences are not randomly shuffled. Our system significantly outperforms all methods listed in the second column by 10.66% to 15.24% and demonstrates the effectiveness of our proposed approach.
|Our best zero-shot Encoder-Classifier||73.88|
|INVERT [Søgaard et al.2015]||62.60|
|BiCVM [Hermann and Blunsom2014]||59.03|
|RANDOM [Vulić and Moens2016]||63.21|
|RATIO [Vulić and Moens2016]||58.64|
In this section, we try to analyze why our simple Encoder-Classifier system is effective at zero-shot classification. We perform a series of experiments to better understand this phenomenon. In particular, we study (1) the effect of shared sub-word vocabulary, (2) the amount of multilingual training data to measure the influence of multilinguality, (3) encoder/classifier capacity to measure the influence of representation power, and (4) model behavior on different training phases to assess the relation between generalization performance on English and zero-shot performance on French.
Effect of Shared Sub-Word Vocabulary.
As mentioned in Section 4.2, we use a shared sub-word vocabulary which can encode both English and French text in all of our models. In this subsection, we analyze how much using a shared sub-word vocabulary can help the model generalize to a new language. To verify the effectiveness of just the sub-word vocabulary on generalization, we picked the German test set from the Amazon Review task. Since German shares many sub-words with English and French, the Out-Of-Vocabulary (OOV) rate for the German test set using our vocabulary is just 0.078%. We design this experiment as a control to understand the effect of having a shared sub-word vocabulary which can encode the language but for which no translation data was seen while training the multilingual NMT encoder.
|+ Pre-trained Encoder||52.98|
|+ Freeze Encoder||57.72|
From Table 4, we can see that despite the very low OOV rate, the ability of our system to perform zero-shot classification on German is close to random, i.e. around 50% accuracy. The third row in the table shows the small deviation of 7% over random, which is likely obtained from common sub-words having similar meaning across languages. This control experiment suggests that although having a shared sub-word vocabulary is necessary, we still need to train the NMT system on parallel data from the language of interest so that the system can perform zero-shot classification.
Effect of Translation Data.
We explore two dimensions that could affect zero-shot performance related to our training data in the multilingual NMT model. First, we investigate the effect of using symmetric training data to train both directions in the multilingual NMT system. We conduct an experiment where we take half of the sentences from the EnFr training set and use the swapped version of the other half of the sentences for training the model. Second, we want to see the effect of training data size, so we run an experiment where we use only half of the training set in a symmetric fashion. From Table 5, we can see that halving the training data size significantly lowers the zero-shot accuracy on the French SNLI test set by 7.16%. However, both the symmetric and asymmetric versions of the data perform comparably on both tasks. This shows that the multilingual NMT system is able to learn an effective interlingua even without the need of symmetric data across the language pairs involved.
|Parallel data type for NMT||SNLI (En)||SNLI (Fr)|
|Symmetric data (full)||84.13||73.88|
|Symmetric data (half)||80.79||66.72|
|Asymmetric data (half)||81.15||67.63|
|Encoder components||Simpler classifier||Complex classifier|
|SNLI (En)||SNLI (Fr)||SNLI (En)||SNLI (Fr)|
|+ bi-directional layer 1||67.99||58.19||83.40||64.74|
|+ layer 2||67.00||61.01||83.63||72.81|
|+ layer 3||67.26||60.55||84.17||74.33|
|+ layer 4||67.26||61.61||84.41||74.11|
Effect of Encoder/Classifier Capacity
We study the effect of the capacity of the two parts of our model on the final accuracies. Specifically, we experimented with two variants of the classifier - a simple linear classifier where we set and networks to identity222We empirically found that for simple classifiers using mean pooling for performs considerably better over max-pooling (67.26 vs 61.19 test accuracies respectively) on the SNLI task. and a complex classifier (details provided in Section 4.2). Next, we experimented with only reusing different parts of the multilingual encoder in a bottom-up fashion. Table 6 summarizes all of our experiments with respect to model capacity. It can be seen that, as expected going from a simple linear classifier to a complex classifier significantly improves both English and zero-shot French performance on the SNLI tasks. However, even a simple linear classifier can achieve significant zero-shot performance when provided with rich enough encodings (49.66 to 61.61 accuracy). However, changing the encoder capacity tells an interesting story. As we selectively reuse parts of the encoder from the embedding layer to the top, we notice that the English performance only increases by about 2% whereas the zero-shot performance increases by about 18% at most in the complex classifier. This means that the additional layers in the encoder are essential for the proposed system to model a language agnostic representation (interlingua) which enables it to perform better zero-shot classification. Moreover, it should be noted that best zero-shot performance is obtained by using the complex classifier and up to layer 3 of the encoder. Although this gap is not big enough to be significant, we hypothesize that top layer of the encoder could be very specific to the MT task and hence might not be best suited for zero-shot classification.
Effect of Early vs Late Phases of the Training
Figure 1 shows that as the number of training steps increases, the test accuracy goes up whereas the test loss on the SNLI task increases slightly hinting at over-fitting on the English task. As expected, choosing checkpoints which are before the onset of the over-fitting seems to benefit zero-shot performance on the French SNLI test set. This suggests that over-training on the English task might hurt the ability of the model to generalize to a new language and also motivated us to conduct the next set of analysis.333We observe that test loss better correlates with zero-shot accuracy than test accuracy.
|Smoothing Range (steps)||SNLI (En)||SNLI (Fr)|
Effect of Parameter Smoothing
Parameter smoothing (checkpoint averaging [Junczys-Dowmunt, Dwojak, and Sennrich2016]
) is a technique which aims to smooth point estimates of the learned parameters by averagingsteps from the training run and using it for inference. This is aimed at improving generalization and being less susceptible to the effects of over-fitting at inference. We hypothesize that a system with enhanced generalization might be better suited for zero-shot classification since it is a measure of the ability of the model to generalize to a new task. Table 7 validates our hypothesis by showing that although the average of 20k steps only improves the English SNLI score by 0.24%, it improves the corresponding French zero-shot score by 1.82%.
In this paper, we have demonstrated a simple yet effective approach to perform cross-lingual transfer learning using representations from a multilingual NMT model. Our proposed approach of reusing the encoder from a multilingual NMT system as a pre-trained component provides significant improvements on three downstream tasks. Further, our approach enables us to perform surprisingly competitive zero-shot classification on an unseen language and outperforms cross-lingual embedding base methods. Finally, we end with a series of analyses which shed light on the factors that contribute to the zero-shot phenomenon. We hope that these results showcase the efficacy of multilingual NMT to learn transferable contextualized representations for many downstream tasks.
- [Agić and Schluter2018] Agić, Z̆., and Schluter, N. 2018. Baselines and Test Data for Cross-Lingual Inference. In LREC.
- [Ammar et al.2016] Ammar, W.; Mulcaire, G.; Tsvetkov, Y.; Lample, G.; Dyer, C.; and Smith, N. A. 2016. Massively multilingual word embeddings. arxiv preprint arXiv:1602.01925.
- [Arora, Liang, and Ma2017] Arora, S.; Liang, Y.; and Ma, T. 2017. A simple but tough-to-beat baseline for sentence embeddings. In ICLR.
- [Ba, Kiros, and Hinton2016] Ba, L. J.; Kiros, R.; and Hinton, G. E. 2016. Layer normalization. arXiv preprint arXiv:1607.06450.
- [Bahdanau, Cho, and Bengio2016] Bahdanau, D.; Cho, K.; and Bengio, Y. 2016. Neural machine translation by jointly learning to align and translate. In ICLR.
- [Bojanowski et al.2017] Bojanowski, P.; Grave, E.; Joulin, A.; and Mikolov, T. 2017. Enriching word vectors with subword information. TACL 5:135–146.
- [Bowman et al.2015] Bowman, S. R.; Angeli, G.; Potts, C.; and Manning, C. D. 2015. A large annotated corpus for learning natural language inference. In EMNLP, 632–642.
[Collobert et al.2011]
Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; and Kuksa,
Natural language processing (almost) from scratch.
Journal of Machine Learning Research12:2493–2537.
- [Conneau et al.2017] Conneau, A.; Kiela, D.; Schwenk, H.; Barrault, L.; and Bordes, A. 2017. Supervised learning of universal sentence representations from natural language inference data. In EMNLP, 670–680.
- [Dong et al.2015] Dong, D.; Wu, H.; He, W.; Yu, D.; and Wang, H. 2015. Multi-task learning for multiple language translation. In ACL, 1723–1732.
- [Faruqui and Dyer2014] Faruqui, M., and Dyer, C. 2014. Improving vector space word representations using multilingual correlation. In EACL, 462–471.
- [Fernández, Esuli, and Sebastiani2016] Fernández, A. M.; Esuli, A.; and Sebastiani, F. 2016. Distributional correspondence indexing for cross-lingual and cross-domain sentiment classification. JAIR 55:131–163.
- [Firat, Cho, and Bengio2016] Firat, O.; Cho, K.; and Bengio, Y. 2016. Multi-way, multilingual neural machine translation with a shared attention mechanism. In NAACL: HLT, 866–875.
[Gouws, Bengio, and
Gouws, S.; Bengio, Y.; and Corrado, G.
Bilbowa: Fast bilingual distributed representations without word alignments.In ICML - Volume 37, 748–756.
- [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR, 770–778.
- [He et al.2017] He, L.; Lee, K.; Lewis, M.; and Zettlemoyer, L. 2017. Deep semantic role labeling: What works and what’s next. In ACL, 473–483.
- [Hermann and Blunsom2014] Hermann, K. M., and Blunsom, P. 2014. Multilingual models for compositional distributed semantics. In ACL, 58–68.
- [Hill et al.2017] Hill, F.; Cho, K.; Jean, S.; and Bengio, Y. 2017. The representational geometry of word meanings acquired by neural machine translation models. Machine Translation 31(1-2):3–18.
- [Hill, Cho, and Korhonen2016] Hill, F.; Cho, K.; and Korhonen, A. 2016. Learning distributed representations of sentences from unlabelled data. In NAACL: HLT, 1367–1377.
- [Huh, Agrawal, and Efros2016] Huh, M.; Agrawal, P.; and Efros, A. A. 2016. What makes imagenet good for transfer learning? arXiv preprint arXiv:1608.08614.
- [Johnson et al.2017] Johnson, M.; Schuster, M.; Le, Q.; Krikun, M.; et al. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. TACL 5:339–351.
- [Junczys-Dowmunt, Dwojak, and Sennrich2016] Junczys-Dowmunt, M.; Dwojak, T.; and Sennrich, R. 2016. The AMU-UEDIN submission to the WMT16 news translation task: Attention-based NMT models as feature functions in phrase-based SMT. In Proceedings of the First Conference on Machine Translation, 319–325.
- [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. In ICLR.
- [Kiros et al.2015] Kiros, R.; Zhu, Y.; Salakhutdinov, R. R.; Zemel, R.; Urtasun, R.; Torralba, A.; and Fidler, S. 2015. Skip-thought vectors. In NIPS 28. 3294–3302.
[Krizhevsky, Sutskever, and
Krizhevsky, A.; Sutskever, I.; and Hinton, G. E.
Imagenet classification with deep convolutional neural networks.In NIPS 25. 1097–1105.
- [Lee, Cho, and Hofmann2017] Lee, J.; Cho, K.; and Hofmann, T. 2017. Fully character-level neural machine translation without explicit segmentation. TACL 5:365–378.
- [Liu et al.2018] Liu, X.; Shen, Y.; Duh, K.; and Gao, J. 2018. Stochastic answer networks for machine reading comprehension. In ACL, 1694–1704.
- [Lu et al.2016] Lu, J.; Yang, J.; Batra, D.; and Parikh, D. 2016. Hierarchical question-image co-attention for visual question answering. In NIPS 29. 289–297.
- [Luong et al.2016] Luong, M.; Le, Q. V.; Sutskever, I.; Vinyals, O.; and Kaiser, L. 2016. Multi-task sequence to sequence learning. In ICLR.
- [Luong, Pham, and Manning2015] Luong, T.; Pham, H.; and Manning, C. D. 2015. Bilingual word representations with monolingual quality in mind. In Proceedings of Workshop on Vector Space Modeling for NLP, 151–159.
- [McCann et al.2017] McCann, B.; Bradbury, J.; Xiong, C.; and Socher, R. 2017. Learned in translation: Contextualized word vectors. In NIPS 30. 6294–6305.
- [Mikolov et al.2013] Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In NIPS 26. 3111–3119.
[Nallapati, Xiang, and
Nallapati, R.; Xiang, B.; and Zhou, B.
Sequence-to-sequence RNNs for text summarization.In ICLR Workshop.
- [Pennington, Socher, and Manning2014] Pennington, J.; Socher, R.; and Manning, C. 2014. Glove: Global vectors for word representation. In EMNLP, 1532–1543.
- [Peters et al.2017] Peters, M.; Ammar, W.; Bhagavatula, C.; and Power, R. 2017. Semi-supervised sequence tagging with bidirectional language models. In ACL, 1756–1765.
- [Peters et al.2018] Peters, M.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized word representations. In NAACL: HLT, 2227–2237.
- [Prettenhofer and Stein2010] Prettenhofer, P., and Stein, B. 2010. Cross-language text classification using structural correspondence learning. In ACL, 1118–1127.
- [Schuster and Nakajima2012] Schuster, M., and Nakajima, K. 2012. Japanese and korean voice search. In ICASSP, 5149–5152.
- [Schuster and Paliwal1997] Schuster, M., and Paliwal, K. K. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45(11):2673–2681.
- [Sennrich, Haddow, and Birch2016] Sennrich, R.; Haddow, B.; and Birch, A. 2016. Neural machine translation of rare words with subword units. In ACL, 1715–1725.
- [Socher et al.2013] Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C. D.; Ng, A.; and Potts, C. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, 1631–1642.
- [Søgaard et al.2015] Søgaard, A.; Agić, v.; Martínez Alonso, H.; Plank, B.; Bohnet, B.; and Johannsen, A. 2015. Inverted indexing for cross-lingual nlp. In ACL and IJCNLP, 1713–1722.
- [Szegedy et al.2016] Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wojna, Z. 2016. Rethinking the inception architecture for computer vision. In CVPR, 2818–2826.
[Tai, Socher, and
Tai, K. S.; Socher, R.; and Manning, C. D.
Improved semantic representations from tree-structured long short-term memory networks.In ACL and IJCNLP, 1556–1566.
- [Vulić and Moens2016] Vulić, I., and Moens, M.-F. 2016. Bilingual distributed word representations from document-aligned comparable data. Artificial Intelligence Research 55(1):953–994.
[Wieting et al.2016]
Wieting, J.; Bansal, M.; Gimpel, K.; and Livescu, K.
Charagram: Embedding words and sentences via character n-grams.In EMNLP, 1504–1515.
- [Wu et al.2016] Wu, Y.; Schuster, M.; Chen, Z.; Le, Q. V.; Norouzi, M.; Macherey, W.; et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv: 1609.08144.
- [Yosinski et al.2014] Yosinski, J.; Clune, J.; Bengio, Y.; and Lipson, H. 2014. How transferable are features in deep neural networks? In NIPS 27. 3320–3328.
- [Zoph and Knight2016] Zoph, B., and Knight, K. 2016. Multi-source neural translation. In NAACL: HLT, 30–34.
- [Zou et al.2013] Zou, W. Y.; Socher, R.; Cer, D.; and Manning, C. D. 2013. Bilingual word embeddings for phrase-based machine translation. In EMNLP, 1393–1398.