Modelling Latent Translations for Cross-Lingual Transfer

07/23/2021 ∙ by Edoardo Maria Ponti, et al. ∙ Google University of Cambridge Montréal Institute of Learning Algorithms 7

While achieving state-of-the-art results in multiple tasks and languages, translation-based cross-lingual transfer is often overlooked in favour of massively multilingual pre-trained encoders. Arguably, this is due to its main limitations: 1) translation errors percolating to the classification phase and 2) the insufficient expressiveness of the maximum-likelihood translation. To remedy this, we propose a new technique that integrates both steps of the traditional pipeline (translation and classification) into a single model, by treating the intermediate translations as a latent random variable. As a result, 1) the neural machine translation system can be fine-tuned with a variant of Minimum Risk Training where the reward is the accuracy of the downstream task classifier. Moreover, 2) multiple samples can be drawn to approximate the expected loss across all possible translations during inference. We evaluate our novel latent translation-based model on a series of multilingual NLU tasks, including commonsense reasoning, paraphrase identification, and natural language inference. We report gains for both zero-shot and few-shot learning setups, up to 2.7 accuracy points on average, which are even more prominent for low-resource languages (e.g., Haitian Creole). Finally, we carry out in-depth analyses comparing different underlying NMT models and assessing the impact of alternative translations on the downstream performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

latent-translation

Code for the paper "Modelling Latent Translations for Cross-Lingual Transfer"


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Cross-lingual knowledge transfer supports the development of natural language technology for many of the world’s languages (Ruder et al., 2019; Ponti et al., 2019, inter alia). The approach currently predominant for cross-lingual transfer relies on massively multilingual pre-trained encoders (Conneau et al., 2020; Xue et al., 2020; Liu et al., 2020) that are fine-tuned on a source language and perform zero-shot (Wu and Dredze, 2019; Ponti et al., 2021) or few-shot (Lauscher et al., 2020; Zhao et al., 2020) prediction in a target language.

An alternative approach for cross-lingual transfer, translate test, is based on translating the evaluation set into the source language and leveraging a monolingual classifier instead (Banea et al., 2008; Durrett et al., 2012; Conneau et al., 2018). This approach is currently under-investigated and usually relegated to the role of a baseline due to its lower flexibility, e.g., it is not suitable for sequence labelling tasks. Yet, it achieves the state-of-the-art results in most benchmarks for multilingual Natural Language Understanding and Question Answering tasks (Hu et al., 2020; Ponti et al., 2020; Ruder et al., 2021, inter alia). Moreover, the availability of off-the-shelf translation models for multiple languages (Wu et al., 2016; Tiedemann and Thottingal, 2020; Liu et al., 2020) provides coverage for transfer to a large number of target languages. Indeed, very recent preliminary results suggest that the translation-based transfer might even outperform monolingual pre-trained models in languages different from English (Isbister et al., 2021).

Translation-based transfer, however, currently suffers from two main limitations. First, the errors in translation accumulate along the pipeline. In fact, sentences that are possibly not faithful to the original in the target language and/or not grammatical in the source language are fed to the classifier, which degrades its performance. Second, only the maximum-likelihood translation is usually retrieved, which may not capture the precise meaning of the original sentence and its most relevant features for the downstream task.

In this work, we propose a method to amend these limitations and further enhance translation-based transfer. In particular, by treating the previously separate components for translation and classification as an integrated system, we re-interpret the traditional pipeline as a single model with an intermediate latent translation between the target text and its classification label. As a consequence of this change, 1) the machine translation component receives a feedback signal from the downstream loss and can be fine-tuned to better adapt to a specific task; 2) multiple translations can be sampled from the latent representation to perform ensemble prediction. Crucially, this method is sample efficient, as both components can be initialised with pre-trained models and deployed in a zero-shot or few-shot learning setting.

Naïvely training the machine translation system via gradient descent, however, is often impossible due to the incompatibility of the token vocabularies of the two components. Therefore, we devise a universal method for fine-tuning that is suitable for any pair of pre-trained translator and classifier. We propose an optimisation scheme based on Minimum Risk Training (MRT; Och, 2003; Smith and Eisner, 2006; Shen et al., 2016, inter alia): it only requires the gradient of the translation scores and a reward based on downstream classification metrics.

Our evaluation is conducted on all multilingual Natural Language Understanding (NLU) tasks that are part of the popular XTREME (Hu et al., 2020) and XTREME-R (Ruder et al., 2021) cross-lingual transfer benchmarks. These include PAWS-X (Yang et al., 2019) for paraphrase identification, XCOPA (Ponti et al., 2020) for commonsense reasoning, and XNLI (Conneau et al., 2018) for natural language inference. Our model improves over standard translation-based methods in zero-shot and few-shot scenarios, up to 2.6 accuracy points on average, with peaks of 5.6 points for resource-poor languages like Haitian Creole.

As an additional contribution, we also examine for the first time the impact of translation quality (as measured by BLEU) and multilingual coverage of several models on the downstream classification performance. In particular, we compare among the Google Cloud Translation API,111https://cloud.google.com/translate Marian MT  (Tiedemann and Thottingal, 2020; Junczys-Dowmunt et al., 2018), and mBART  (Liu et al., 2020; Tang et al., 2020), revealing substantial differences among these models. We release our code publicly at: github.com/McGill-NLP/latent-translation.

2 Latent Translation Model

In the present work, we are concerned with the problem of performing zero/few-shot inference in any given target language by transferring knowledge from a source language . Specifically, we focus on classification tasks with data of the form , where is a discrete sequence of tokens from the vocabulary and is a label index. We further assume that a parallel corpus is available. This enables translation-based transfer (Banea et al., 2008; Durrett et al., 2012), which comes in two flavours: either the evaluation set can be translated into (translate test), or the training set can be translated into (translate train). We opt for the former, as it is both more efficient (due to ) and effective (Conneau et al., 2018; Hu et al., 2020).

‘Translate test’ transfer relies on two main components: 1) a classifier parameterised by and trained on and 2) a translator parameterised by trained on . These are deployed for predictive inference sequentially in the following pipeline: first, the -th target language sentence(s) are mapped to their translation in the source language . Afterwards, is fed to the classifier to produce the label .

However, this pipeline is arguably encumbered by two main limitations. Firstly, there is no information flow between the translator and the classifier; therefore, the errors in translation cannot be corrected in the subsequent step. Secondly, there may exist multiple correct translations, each reflecting different facets of the original sentence. Therefore, a single maximum-likelihood translation may not be representative of the underlying distribution, which is conceivably multi-modal.

Figure 1: A Bayesian graph of the generative model for latent translation cross-lingual transfer.

Therefore, to grapple with both these problems, we propose to integrate both the translator and the classifier components into a unified model. This amounts to treating the translations as a latent random variable situated in between the target language sentences and the label . From a Bayesian perspective, this is equivalent to the graphical model shown in Figure 1. Hence, if we assume the conditional independence , posterior inference over the neural parameters and given

requires estimating:

(1)

In other words, the latent variable must be integrated out. By virtue of Equation 1, the estimate for the translator parameters is influenced by the label . Hence, out-of-the-box translation models can be adapted to the domain- or task-specific cues based on the feedback that downstream classification provides for any translation that they generate. Moreover, the entirety of the space is explored rather than just the maximum-likelihood sequence. Thus, the multi-faceted semantics of the input sentence in the target language is better preserved.

This formulation, however, poses additional challenges. First, the domain

is countably infinite. Therefore, integrating over this space, and estimating the full probability distribution, is virtually impossible. Second, being

discrete sequences, the model is not fully differentiable. Hence, training it via gradient descent is not trivial. In what follows, we propose approximate solutions in order to perform inference under our model.

2.1 Monte Carlo Sampling of Translations

While it may not be feasible to integrate over the space of all possible translations, we can approximate the likelihood term of Equation 1 through a finite set of Monte Carlo samples:

(2)

In practice, this amounts to performing an ensemble prediction,222

In addition to Monte Carlo sampling, we also experimented with weighted averages, where the weight was the normalised sample probability. However, we found this to be detrimental to downstream performance.

where candidate outputs of the translator are fed to the classifier. The predictive distributions yielded by the classifier given each candidate are then averaged:

(3)

and indexes to the probability of the gold label of the -th example and is the softmax function.

2.2 Minimum Risk Training

The second difficulty of our integrated model lies in the fact that the latent translations are discrete sequences, which implies a non-differentiable hard decision boundary. This could be easily addressed by relaxing the tokens that are part of the translation into continuous variables (Maddison et al., 2017; Jang et al., 2017) or adopting straight-through estimators (Bengio et al., 2013; Raiko et al., 2014). Nonetheless, this may still be impractical. In fact, it is common to instantiate both the translator and the classifier with pre-trained neural models. In this case, the output vocabulary of the former often does not coincide with the latter, due to different sub-word tokenisation strategies (Rust et al., 2020). Therefore, the domain of the translator output and the classifier input may also not correspond.

In order to make our method as general as possible and match any pair of translator and classifier with arbitrary vocabularies, we resort instead to a reinforcement learning technique to fine-tune the translator parameters. In particular, we adopt a version of Minimum Risk Training

(MRT; Shen et al., 2016): its key idea is minimising the risk (expressed as a negative reward weighted by its probability). MRT is typically harnessed for NMT as a downstream task; the reward is thus BLEU or similar metrics. However, in our setting we propose to use classification accuracy as a reward. Let represent the score (i.e. the unnormalised probability) of a translation. The loss can be formulated as follows:

(4)

where is the number of training inputs (here: few-shot learning), the number of translation samples generated for each input, a latent translation, and the MT model parameters. The downstream task reward is , the log-likelihood of the classifier prediction based on the -th individual translation.333Larger sample sizes encourage the MT model to explore more alternative translations and yield closer approximations to the true expected risk, but are expensive to compute.

We optimise the parameters through gradient descent, where the gradient of the loss in Equation 4 with respect to the -th weight is computed as:

(5)

where the expectations are computed by explicitly enumerating the Monte Carlo samples.

2.3 MAP Inference

Combining the objectives outlined in Section 2.1 and Section 2.2, we finally obtain the maximum-a-posteriori (MAP) approximation to perform posterior inference over the graphical model in Figure 1. This is expressed in the following objective, which we use for fine-tuning the classifier and translator parameters during few-shot learning:

(6)

where is taken from Section 2.1 and is taken from Equation 4. Note that Equation 6 contains two regularisers, which correspond to the prior terms in Equation 1. In particular, we take and .

In zero-shot setups, and after that the parameters have been fine-tuned in few-shot setups, we perform predictive inference on new data points in the evaluation set through the ensemble of Monte Carlo samples, as described in Equation 2.

3 Experimental Setup

Evaluation Tasks and Data. We conduct experiments on three established cross-lingual transfer datasets for natural language understandings tasks. 1) PAWS-X (Yang et al., 2019) for paraphrase identification: given a pair of sentences, a binary label specifies whether they express an identical meaning; 2) XCOPA (Ponti et al., 2020) for commonsense causal reasoning: given a premise, a question, and a pair of (cause, effect) hypotheses, the model must determine which of the two is correct; 3) XNLI (Conneau et al., 2018) for natural language inference: a pair of sentences is classified as either an entailment, a contradiction, or a neutral relationship. Together, these 3 tasks cover a wide variety of typologically diverse languages (22 distinct ones in addition to English).

Following prior work Hu et al. (2020); Ruder et al. (2021), English is the source language in all experiments. In all three tasks, the English training set is used to train the classifier, the English development set for hyper-parameter selection, the development sets in other languages for few-shot learning (i.e., for fine-tuning both the classifier and translator), and the test set of the target languages for evaluation.

Machine Translation Systems. In order to assess the impact of the underlying translation model on downstream performance, we compare three established NMT system: 1) a closed-source software, the Google Cloud Translation API.444As of 6 October 2020 for XCOPA and 30 April 2021 for PAWS-X and XNLI.

Moreover, we consider two open-source systems: 2) mBART 

(Liu et al., 2020; Tang et al., 2020), a multilingual model covering 50 languages pre-trained on a denoising objective and fine-tuned on parallel data. 3) Marian MT (Junczys-Dowmunt et al., 2018), a set of hundreds of pair-wise models which were directly trained on parallel data from OPUS (Tiedemann and Thottingal, 2020).555Pre-trained models are sourced from github.com/huggingface/transformers: for Marian Helsinki -NLP/opus-mt-{src}-{tgt} whereas for mBART facebook/mbart-large-50-many-to-one-mmt.

mBART is an encoder-decoder model where both the encoder and decoder have Transfomer layers and attention heads per layer. The hidden dimension is , whereas the FFN inner dimension is . Marian MT is a lighter model, where all the parameters are exactly halved. Further, Marian MT varies from mBART slightly in other regards: it employs static (sinusoid) positional embeddings and does not perform layer normalisation.

Several languages in our set of cross-lingual tasks are covered by some translation systems: no system currently covers qu;666We refer to languages through their ISO 639-1 code. mBART also lacks bg, el, and ht, Marian MT lacks el, sw, and ta.777 Artetxe et al. (2020) noted that the joint translation of all sentences that are part of a same example (e.g., premise and hypothesis) might be beneficial, as this retains their lexical overlap. Despite this, in our implementation, they are translated separately, as Marian and mBART do not handle multiple inputs simultaneously.

Classifier. As a classifier for the output of all MT systems, we use RoBERTa Large (Liu et al., 2019), a

-layer monolingual pre-trained encoder for English, with a 2-layer perceptron (MLP) head. The encoder’s hidden size is

, whereas the internal representation of the MLPs (both in the encoder and the head) is .

In order to establish another common (and not translation-based) baseline in all evaluation tasks, we also fine-tune a multilingual encoder with an identical configuration to RoBERTa, XLM-R Large (Conneau et al., 2020). In this case, the target language text is fed directly to the classifier, without requiring translation. We label this approach ME, as opposed to ‘translate test’ (TT).888We also experimented with using massively multilingual NMT models, such as mBART, as encoders for ME-style transfer (Eriguchi et al., 2018; Siddhant et al., 2020). However, their scores significantly lag behind XLM-R Large. For brevity, we report them in Table 3 in Appendix.

Optimisation. Both during fine-tuning on English and few-shot learning on the target language, we train all models for epochs. The classifier’s parameters are optimised through Adam (Kingma and Ba, 2014) with learning rate and , whereas the translator’s parameters through SGD with learning rate . We use a dropout of during fine-tuning, and clip the gradient norm to . The maximum sequence length is , and the batch size is . Finally, for translation sampling we select the most likely sequences with a beam size of and a temperature of . We verified empirically that probabilistic sampling performs worse (cf. Figure 3).

ME TT: RoBERTa +

XLM-R

mBART

Google

Marian

Marian

L

1

1

1

12

en 85.4 89.8 89.8 89.8 89.8
et 72.6 69.8 82.2 83.4 84.4
ht 75.4 56.0 61.6
id 81.8 80.6 83.8 82.2 85.2
it 77.0 74.0 85.8 80.6 79.8
qu
sw 62.8 50.2 76.6
ta 71.2 69.8 81.8
th 72.4 62.6 76.4 74.6 77.2
tr 71.6 74.6 83.4 79.6 83.0
vi 77.6 76.2 83.0 76.0 79.2
zh 80.2 82.8 85.2 82.4 85.2
avg 74.1 71.2 81.4 76.9 79.5
(a) XCOPA: zero-shot learning.
ME TT: RoBERTa +

XLM-R

mBART

Google

Marian

Marian

+ MRT

L

1

1

1

12

12

en 87.4 89.6 89.6 89.6 89.6 89.6
et 73.8 69.8 82.0 85.4 86.4 86.6
ht 79.0 56.2 61.4 61.0
id 83.0 80.8 84.4 86.2 87.2 87.4
it 77.0 78.6 86.2 85.2 84.2 86.2
qu
sw 60.8 47.8 77.4
ta 72.8 72.2 80.0
th 75.0 62.0 82.6 76.2 78.4 79.8
tr 73.0 75.4 82.4 81.4 83.0 83.8
vi 76.2 77.2 82.0 79.0 78.0 81.2
zh 80.8 81.6 85.6 84.8 84.0 85.2
avg 74.7 71.7 82.2 79.3 80.3 81.4
(b) XCOPA: few-shot learning.
ME TT: RoBERTa +

XLM-R

mBART

Google

Marian

Marian

L

1

1

1

12

en 95.75 95.99 95.99 95.99 95.99
de 90.60 89.54 91.25 91.05 91.40
es 91.60 87.79 92.05 91.45 91.80
fr 92.30 89.94 92.20 91.40 91.90
ja 81.59 77.49 81.09 72.89 74.54
ko 83.04 74.59 81.49 73.04 73.24
zh 84.34 82.04 85.24 82.44 82.64
avg 87.24 83.57 87.22 83.71 84.25
(c) PAWS-X: zero-shot learning.
ME TT: RoBERTa +

XLM-R

mBART

Google

Marian

Marian

+ MRT

L

1

1

1

12

12

en 97.10 96.85 96.85 96.85 96.85 96.85
de 92.85 91.95 93.55 93.05 92.55 93.40
es 93.20 91.45 93.25 92.80 94.10 93.60
fr 93.35 92.20 93.55 93.55 93.70 93.55
ja 85.19 83.04 82.94 79.54 81.19 81.39
ko 85.69 79.39 85.54 80.34 80.49 80.54
zh 87.19 85.34 88.44 86.44 87.04 87.49
avg 89.58 87.23 89.55 87.62 88.18 88.33
(d) PAWS-X: few-shot learning.
ME TT: RoBERTa +

XLM-R

mBART

Google

Marian

Marian

L

1

1

1

12

en 88.84 91.24 91.24 91.24 91.24
ar 79.58 72.83 82.27 78.60 79.98
bg 83.21 85.21 84.43 85.01
de 82.97 82.71 85.45 84.49 85.37
el 82.03 84.09
es 84.27 78.56 86.88 85.73 86.44
fr 82.95 82.35 85.33 84.51 85.09
hi 76.48 73.25 77.26 63.71 65.14
ru 79.34 77.98 82.23 79.68 81.13
sw 72.19 34.66 75.12
th 76.92 45.28 77.40 74.07 75.34
tr 78.94 74.15 81.57 79.88 80.57
ur 72.57 60.55 71.79 55.3.0 55.44
vi 80.12 75.86 81.79 77.70 78.80
zh 80.00 78.34 81.73 79.02 79.78
avg 80.03 71.37 81.96 78.33 79.18
(e) XNLI: zero-shot learning.
ME TT: RoBERTa +

XLM-R

mBART

Google

Marian

Marian

+ MRT

L

1

1

1

12

12

en 89.54 91.24 91.24 91.24 91.24 91.24
ar 81.83 79.78 84.73 82.61 83.41 83.57
bg 85.49 87.36 86.46 87.20 87.2
de 84.59 85.55 87.20 86.11 86.86 87.5
el 83.95 86.42
es 85.77 81.41 87.66 87.68 88.08 88.26
fr 84.87 84.97 86.84 86.42 86.62 86.76
hi 79.64 78.84 80.67 72.73 74.67 74.97
ru 82.85 82.15 84.52 83.43 84.15 84.51
sw 76.12 41.63 79.70
th 80.52 53.30 81.07 79.84 80.83 80.95
tr 81.35 79.50 84.28 83.75 84.31 84.33
ur 76.30 69.93 77.14 63.63 64.22 65.56
vi 83.03 81.07 84.11 81.85 83.47 82.97
zh 82.35 82.45 84.25 82.67 84.19 84.13
avg 82.05 75.05 84.00 81.43 82.34 82.55
(f) XNLI: few-shot learning.
Table 1: Results (Accuracy100) in zero-shot (left columns) and few-shot (right columns) scenarios for XCOPA (top tables), PAWS-X (centre tables), and XNLI (bottom tables). MT system does not cover the target language. The numbers after each MT system refer to the number of translation samples (see Section 3).

4 Results and Discussion

The main results on XCOPA, PAWS-X, and XNLI, both in zero-shot and few-shot transfer scenarios, are summarised in Table 1. In addition to the baselines, we report results on Monte Carlo sampling and MRT only for the best performing open-source model, Marian.999Moreover, these results are not available for Google MT because the API does not allow for multiple translations or fine-tuning. The scores offer multiple axes in comparison, discussed in what follows.

Multilingual Encoders versus Translate Test. Inspecting the global trends in Table 1, the results mostly corroborate the received wisdom from prior work Hu et al. (2020); Ruder et al. (2021); Ponti et al. (2020): translate test coupling Google MT with a monolingual English encoder yields stronger cross-lingual transfer performance on average than using a state-of-the-art massively multilingual encoder such as XLM-R Large. However, while the finding is true when using Google MT, we note that 1) it does not hold across the board with other NMT systems, and 2) ME-based transfer with XLM-R Large is confirmed as a strong non-MT transfer baseline. For instance, 1-best mBART falls behind XLM-R Large in all three tasks, and likewise for 1-best Marian MT in PAWS-X and XNLI.

A Comparison of NMT Models (1-best). Task performance varies dramatically according to the chosen MT system; what is more, it can serve as a (non-ideal) proxy of MT system quality. Google translations are by far the best (compare the average scores at the bottom of Table 1), with pronounced gains over the two competitors in all three tasks, especially in zero-shot setups. Marian MT also displays significantly better results than mBART across the board, especially on XCOPA and XNLI, despite its smaller parameter count. Arguably, this is caused by the fact that Marian MT has separate models available for each language pair, whereas mBART is massively multilingual. This effect is known as the ‘curse of multilinguality’ (Conneau et al., 2020). Finally, while Marian MT reduces the gap to Google in few-shot transfer, Google remains the strongest alternative in this setup, too, for all three tasks.

Multiple Samples and MRT. Our latent-based translation approach yields consistent gains over the base 1-best MT system in all three tasks: the improvements of 12-best Marian over its 1-best variant are observed in all zero-shot and few-shot runs. The further inclusion of MRT in few-shot setups results in additional small but consistent boosts, again in all evaluation tasks. This confirms that both components, as discussed in Section 2, indeed focus on mitigating distinct limitations of the standard translation-based approach, and thus offer complementary benefits to the final task performance. Using multiple samples with MRT, an initially weaker NMT system can recover several points of performance, even outperforming an initially stronger NMT system for some tasks and languages. For instance, MRT with 12 samples is the peak-scoring variant for tr by 2 points, despite “starting” 2 points behind the Google 1-best baseline. We observe similar trends, among others, for es and fr on PAWS-X, or for ru and zh on XNLI. In sum, this implies that latent translation transfer can enhance existing models even with zero or few examples available, at the expense of a modest increase in inference time (cf. Section 2.3).101010In addition to MRT, we considered fine-tuning the translator through self-training (Sennrich et al., 2016) and learning a re-ranker for weighted ensemble prediction (Dong et al., 2017). Compared to MRT, both yield sub-par results, which are reported in Table 3 in Appendix.

Performance across Languages. The scores over individual target languages of all translate-test variants also reveal the presence of ample headroom under English performance in all three tasks. This is due to “information lost” owing to imperfect translation. As expected, larger gaps are detected with target languages more dissimilar to English (e.g., compare the scores of ja, ko, and zh versus es, fr, de on PAWS-X), and for lower-resource languages with smaller amounts of parallel data (e.g., ht in XCOPA, sw in XCOPA and XNLI).

Figure 2: BLEU scores of the 1-best translation for 3 MT models (Google, Marian, and mBART) on the development sets of 3 tasks (XCOPA, PAWS-X, XNLI). Each language is a horizontal line.
Figure 3: Average cross-lingual performance of ensemble prediction based on different numbers of samples.

Performance across Tasks. Finally, a cross-task comparison reveals that the largest benefits of the translation-based approaches are observed on the (arguably most complex) XCOPA task. Here, the gains with Google 1-best and Marian 1-best over the ME approach are pronounced in both zero-shot and few-shot setups. Moreover, the use of multiple samples plus MRT yields further performance gains, highest across all tasks. The benefits of the TT approach compared to ME are smaller on XNLI, and non-present on the PAWS-X task, which is the most saturated and least linguistically diverse (Ruder et al., 2021). However, the results across all three tasks do confirm the benefits of the proposed latent translation-based approach, which always improves over the base MT system.

5 In-Depth Analysis

Figure 4: MT quality vs accuracy gains when using 12 rather than 1 translation samples across tasks and languages.

Number of Samples. We plot the effect of varying the number of translation samples from 1 to 12 on XCOPA accuracy in Figure 3. As it emerges, the performance metric increases monotonically. The lack of a plateau in the considered interval suggests that larger numbers could ameliorate accuracy even further. Moreover, note that -best deterministic sampling appears superior to probabilistic sampling for all -s.

Translation Quality. In Figure 2, we report the BLEU scores111111Evaluated through sacrebleu (Post, 2018).

of the 1-best translation of all NMT systems on the development sets of all languages. We source the gold references from the English datasets (COPA, PAWS, and SNLI) from which XCOPA, PAWS-X, and XNLI, respectively, were manually translated. The violin plot reveals large gaps in BLEU based on the distance from English. This is most evident in PAWS-X, with

ja, ko, and zh on the bottom and de, es, fr on the top. Moreover, BLEU levels vary by task: while XCOPA has the shortest sentences, it is also the most typologically diverse. This makes the dataset easier to translate in some respects, but harder in others.

Hence, one might wonder what is the relationship between these 1-best BLEU scores and gains due to multiple samples. Figure 4 depicts these two quantities. For XCOPA and XNLI, there is a trend for higher gains with lower translation quality. ht is the clearest example, with only 12.6 BLEU and a

of around 5 points in accuracy. However, for PAWS-X we observe almost no linear correlation, which might be caused by overall gains being much smaller. There might also be a minimum translation quality below which multiple samples cannot bring benefits, which would explain the outlier

ur in XNLI. On the other extreme, in a few languages with best-quality MT, such as it for XCOPA and de for PAWS-X, multiple samples seem to mislead the classifier in few-shot transfer.

Translation Ranking. The fact that lower-scoring translations positively contribute during ensemble prediction is counter-intuitive. Yet, we verify that among the -best translations, higher-ranking ones are not necessarily associated with better classification accuracy, even when their log-probability is significantly greater (see Figure 5 in Appendix). Table 2 shows an example for tr in XCOPA, where lower-ranking translations turn the ensemble decision correct when the highest rank translation leads to the wrong classification. In this case, they resolve an incorrect disambiguation of gender. However, in other cases it is hard to interpret how minor lexical changes affect the ensemble prediction.

NLL Premise: Adam partide çok içti. Hyp. 1: Ertesi gün başı ağrıdı. Hyp. 2: Ertesi gün burnu aktı.
-1.49 He drank too much at the party. She had a headache the next day. The next day, he had a runny nose. 1
-2.07 The guy drank too much at the party. He had a headache the next day. The next day, his nose leaked. 0
-2.26 The guy drank a lot at the party. He got a headache the next day. The next day, his nose ran. 0
Table 2: Marian MT translations from ranks 1-3 for XCOPA tr with model scores (NLL) and classifier predictions (). The prompt for this example is: “What happened as a result?” and the correct label is 0.

These insights offer a provocative question for future work: can similar application-oriented evaluations of MT systems reach beyond standard intrinsic evaluation protocols such as BLEU Papineni et al. (2002) or METEOR Banerjee and Lavie (2005)? In other words, assessing how well the NMT models support cross-lingual transfer might provide additional empirical evidence on their translation abilities.

6 Conclusion and Future Work

We proposed a new method to perform translation-based cross-lingual transfer, by treating the translation of the input text in a target language as a latent random variable. This unifies under a single model both components (a translator and a classifier) of the traditional pipeline, which were previously learned separately and deployed in consecutive steps. As a consequence, in our model, 1) multiple translations can be generated with Monte Carlo sampling to better render the original meaning and 2) the translator can be adapted to the downstream task and correct its errors based on the feedback from the classifier through a variant of Minimum Risk Training. We demonstrate the effectiveness of our method on several benchmarks for natural language understanding, including commonsense causal reasoning, paraphrase identification, and natural language inference. Furthermore, we find that classification performance varies dramatically according to the translation quality of the underlying translator model, whereas its internal ranking of -best translations plays almost no role. We hope that our findings will provide an incentive to improve language coverage and quality of (especially public) NMT models to support a wide array of multilingual NLP applications.

References

Appendix A Translation Quality Across Ranks

Figure 5 shows how in XNLI the share of correct predictions are relatively stable across translations of different ranks within each language. For some languages the top-raking translations are the best candidate (e.g. es, ur), but for others it may be lower-ranked translation (e.g. 3rd best for hi or 8th best for th).

Appendix B Additional Results

ME TT: RoBERTa +
mBART Marian Marian
+Self-train +Re-ranker
en 77.0 89.6 89.6
et 56.2 86.6 70.4
ht 58.6 59.6
id 72.8 86.6 59.4
it 61.2 85.4 64.6
qu
sw 54.4
ta 61.4
th 67.8 78.4 59.4
tr 64.6 84.8 77.2
vi 67.6 78.0 60.6
zh 67.4 85.4 64.8
avg 63.7 79.3 64.5
Table 3: Additional few-shot learning results on XCOPA for alternative multilingual encoders pre-trained on NMT (mBART) and alternative auxiliary objectives (self-training and re-ranking).

Appendix C Related Work

Minimum Risk Training (MRT) is a technique used for tuning MT models towards given evaluation metrics, and was introduced for statistical MT models 

(Och, 2003; Arun et al., 2010; Smith and Eisner, 2006) and adapted for NMT (Shen et al., 2016; Edunov et al., 2018; Wieting et al., 2019; Wang and Sennrich, 2020; Saunders et al., 2020). The advantages over classic maximum likelihood training with reference translations are that it mitigates exposure bias and addresses the loss-evaluation mismatch (Ranzato et al., 2016; Wiseman and Rush, 2016; Wang and Sennrich, 2020) by incorporating the evaluation metric directly into the loss, rewarding high-scoring model outputs and penalizing lower-scoring ones. This metric does not need to be differentiable (e.g. BLEU), and gradients are approximated with Monte-Carlo sampling. MRT has been found effective not only in NMT, but other sequence-to-sequence NLP tasks such as abstractive summarization (Edunov et al., 2018; Makino et al., 2019), string transduction (Makarov and Clematide, 2018), and referring expression generation (Panagiaris et al., 2020). While each task comes with its own evaluation metrics, the rewards can also be received from other neural models (He et al., 2016; Wieting et al., 2019), user feedback (Kreutzer et al., 2018), or a downstream task such as the success of execution of a semantic parse (Misra et al., 2018; Jehl et al., 2019) or cross-lingual information retrieval (Sokolov et al., 2014). The latter works are similar to ours in that we leverage downstream task signals to adapt the MT model.

Appendix D Details for Reproducibility

We carried out all our experiments on a single 48GB RTX 8000 GPU with Turing architecture. On average, runtime is hours for fine-tuning on the English training set, minutes per language for fine-tuning on the target language development set (few-shot learning), and minutes per language for evaluation.