Log In Sign Up

FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation

by   Parker Riley, et al.

We present FRMT, a new dataset and evaluation benchmark for Few-shot Region-aware Machine Translation, a type of style-targeted translation. The dataset consists of professional translations from English into two regional variants each of Portuguese and Mandarin Chinese. Source documents are selected to enable detailed analysis of phenomena of interest, including lexically distinct terms and distractor terms. We explore automatic evaluation metrics for FRMT and validate their correlation with expert human evaluation across both region-matched and mismatched rating scenarios. Finally, we present a number of baseline models for this task, and offer guidelines for how researchers can train, evaluate, and compare their own models. Our dataset and evaluation code are publicly available:


page 8

page 9


PETCI: A Parallel English Translation Dataset of Chinese Idioms

Idioms are an important language phenomenon in Chinese, but idiom transl...

Discourse Structure in Machine Translation Evaluation

In this article, we explore the potential of using sentence-level discou...

Achieving Human Parity on Automatic Chinese to English News Translation

Machine translation has made rapid advances in recent years. Millions of...

Has Machine Translation Achieved Human Parity? A Case for Document-level Evaluation

Recent research suggests that neural machine translation achieves parity...

A Set of Recommendations for Assessing Human-Machine Parity in Language Translation

The quality of machine translation has increased remarkably over the pas...

A Benchmark Dataset for Understandable Medical Language Translation

In this paper, we introduce MedLane – a new human-annotated Medical Lang...

1 Introduction

Machine translation (MT) has made rapid advances in quality in recent years, and has achieved impressive performance for many language pairs, especially high-resource pairs for which parallel data is widely available. Typically, MT inputs and outputs are only specified at the coarse level of a language, such as Spanish or Hindi. However, some prior work has explored finer-grained distinctions, such as between regional language varieties of Arabic Zbib et al. (2012), or specific levels of politeness in German Sennrich et al. (2016). Many more such regional and stylistic distinctions could be made. Unfortunately, most approaches to style-targeted translation thus far rely on large datasets with the relevant varieties explicitly labeled Zbib et al. (2012); Lakew et al. (2018); Costa-jussà et al. (2018); Honnet et al. (2018); Sajjad et al. (2020); Wan et al. (2020); Kumar et al. (2021), and in many cases these resources are unavailable or expensive to create.

Figure 1: FRMT requires a machine translation model to adapt its output to be appropriate for a specific region, such as Brazil (left) or Portugal (right). Because only a few exemplars are provided to convey the target region, methods that perform well on FRMT can likely extend to other regions and styles.

In this paper, we explore a setting for MT where training data not labeled for region is plentiful for the desired language pair, but only a few examples (10 or 100) are annotated for the target varieties. As a specific use-case, we examine translation into regional varieties: Brazilian vs. European Portuguese and Mainland vs. Taiwan Mandarin. While these varieties are mutually intelligible, they often exhibit significant lexical, syntactic, or orthographic differences, which can negatively affect an MT user’s experience.

As today’s MT systems do not typically utilize region or style labels, they tend to be biased towards varieties that are more prevalent within their training data. We observe this bias in a widely used proprietary MT system, with measurable negative effects for speakers of minority varieties (see section §6.2). To encourage more equity and access to language technologies for these groups, and to catalyze further NLP research around regional language varieties, we make the following contributions: (1) We construct and release FRMT, a new dataset for evaluating few-shot region-aware translation from English to Brazilian/European Portuguese and Mainland/Taiwan Mandarin. (2) We gather predictions from a number of existing and custom-trained baseline systems on the FRMT task. (3) We conduct detailed human evaluations of gold and model-based translations on FRMT, under all combinations of rater region and target region. (4) We propose automatic evaluation metrics for FRMT that correlate with human evaluations, as well as a new targeted metric for lexical accuracy.

2 Related Work

Previous work on textual “style transfer” or “attribute rewriting” is related in trying to control fine-level stylistic features of generated text. Earlier work in this space leverages supervised parallel data

Jhamtani et al. (2017). Later work assumes labeled but non-parallel training data Shen et al. (2017); Li et al. (2018); Niu et al. (2018a), or foregoes training-time labels entirely, as in our setting, relying only on few-shot exemplars provided at inference time Xu et al. (2020); Riley et al. (2021); Garcia et al. (2021). However, there is broad agreement that style transfer evaluation protocols are far from ideal Pang and Gimpel (2019); Briakou et al. (2021); Hu et al. (2022), due to the underspecification of stylistic attributes (e.g. formality, sentiment) and the lack of standardization across studies. Region-aware translation addresses these issues, providing an excellent test-bed for exploring few-shot attribute control—MT evaluation methods are relatively mature, and many regional language varieties can be sufficiently delineated for the task.

Previous work has explored many sub-types of variety-targeted MT. Region-aware MT targets specific regions or dialects Zbib et al. (2012); Costa-jussà et al. (2018); Honnet et al. (2018); Lakew et al. (2018); Sajjad et al. (2020); Wan et al. (2020); Kumar et al. (2021). Formality-aware MT targets different formality levels Niu et al. (2017, 2018b); Wang et al. (2019). Personalized MT aims to match an individual’s specific style Michel and Neubig (2018); Vincent (2021). However, with few exceptions (e.g. Garcia et al. 2021), these works assume the availability of large-scale datasets containing parallel or monolingual examples with the target varieties explicitly labeled. In the present work, we emphasize the value of few-shot adaptability, and design a benchmark with this in mind. Although our dataset is limited to four regions of two languages, the few-shot setup means that approaches performing well on FRMT can be expected to generalize well to other languages, other regions, and other stylistic attributes.

Several existing parallel corpora cover regional language varieties, but have limitations that motivate us to construct a new high-quality dataset that is targeted and readily accessible. e-PACT Barreiro and Mota (2017) comprises translations from English books into Portuguese variants, but is small and not easily accessible. OpenSubTitles Lison et al. (2018)skews toward shorter utterances and is noisy due to automatic alignment. WIT3 Cettolo et al. (2012) provides translations of TED-talk transcripts into many languages, but relies on volunteer translators which may limit quality.

Popular shared tasks have not included region-targeted translation either: The Workshop on Machine Translation (WMT) has included translation between similar languages (e.g. Akhbardeh et al., 2021), while the Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial) focuses mainly on classification and not translation (e.g. Zampieri et al., 2021).

Furthermore, we are not aware of previous work that (i) measures deltas in human evaluation metrics between the region-matched and -mismatched setting, (ii) correlates these with automated metrics, (iii) offers tailored sub-tasks targeting region-differentiated lexical items and region-biased distractors, or (iv) defines targeted metrics testing region-appropriateness.

3 FRMT Dataset

We introduce the FRMT dataset for evaluating the quality of few-shot region-aware machine translation. The dataset covers two regions each for Portuguese (Brazil and Portugal) and Mandarin (Mainland and Taiwan). Our overall approach for constructing the dataset is to sample English sentences from Wikipedia and acquire professional human translations into the target regional varieties. Final quality control is done through manual evaluation by an independent set of translators, using the MQM protocol Freitag et al. (2021a) that we also employ to evaluate system translation quality.

3.1 Data sampling method

FRMT seeks to capture region-specific linguistic differences, as well as potential distractors in the form of entities that are strongly associated with one region, without necessarily having a different linguistic rendering (e.g., Lisbon vs. São Paulo). To this end, we divide the dataset into three buckets (lexical, entity, random), each containing human translations of sentences extracted from different sets of English Wikipedia articles.111As Wikipedia data source we use the training split of wiki40b (v1.3.0) by Guo et al. (2020), available at

Lexical: From various web sources, we manually collected English lexical items for which the best translation into the target language differs depending on the target region. Given a triple () for an English term and its translations into varieties and , a native speaker of region confirmed that is the commonly used term for their region while is not, and vice versa. This is done independently for Mandarin and Portuguese as target languages, yielding term lists of 15 and 23, respectively. We then extract up to 100 sentences from the beginning of each English Wikipedia article with term as title.


: We manually select a balanced number of entities per language region such that they are strongly associated with one region, relying primarily on world knowledge and following Wikipedia hyperlinks. Selection is done within each of a few broad entity types defined at the outset: people, locations, organizations, attractions/infrastructure, and other. To mitigate availability bias in picking entities, we also took inspiration from automatically extracted entity mentions most distinctive (by log-odds) of either top-level web domain across the mC4 corpus

Xue et al. (2021), for .pt vs. .br, and .cn vs. .tw. The final selection comprises 38 Mandarin-focused and 34 Portuguese-focused entities. Again we extract 100 source sentences from the beginning of the English Wikipedia article about each selected entity.

Random: To include more naturally distributed phenomena, we also sampled 100 articles at random from the combined set of 28k articles appearing in Wikipedia’s collections of “featured” or “good” articles.222 and (as of 2021-12-15). We accept the inherent selection bias of such curated collections in exchange for improved text quality and factual accuracy. Here, we can extract less text from more articles, taking up to 20 contiguous sentences from the start of a randomly chosen section within each article. The latter step counteracts the somewhat formulaic article structure that is common to the Wikipedia genre. Unlike the other two buckets, this one features one common set of sentences to be translated into all four target variants.

3.2 Human translation

14 paid professionals translated the selected English texts into the four target language variants: 4 translators per Mandarin region and 3 per Portuguese region. Each sentence was translated by one translator. Translation was done one sentence at a time in the order of the original text. Translators were instructed to keep the original wording and orthography on the few occasions where the English source text includes phrases in a non-English language.

A small number of source sentences were rejected by one or more translators; reasons include excessive amounts of non-English text and incomplete sentences. We filtered out all sentences that were rejected by at least one translator; these are not included in our released dataset.

3.3 Corpus statistics

For each bucket, we split our data into exemplar, development (dev), and test data. The exemplars are intended to be the only pairs where the region label is shown to the model, such as via few-shot or in-context learning Brown et al. (2020).

Table 1 reports the number of released sentence pairs for each combination of bucket, split, and language. For each document in our dataset, all sentences from that document appear only in a single split—this ensures that a system cannot “cheat” by memorizing word-region associations from the exemplars, or by overfitting to words and entities while hill-climbing on the validation set.

Bucket Split Portuguese Mandarin
Lexical Exemplar 118 173
Dev 848 524
Test 874 538
Entities Exemplar 112 104
Dev 935 883
Test 985 932
Random Exemplar 111 111
Dev 744 744
Test 757 757
Total Exemplar 341 388
Dev 2527 2151
Test 2616 2227
Table 1: Number of sentence pairs by bucket, split, and language, as well as cross-bucket totals. Note, the random bucket contains the same English source sentences across the Portuguese and Mandarin sets.

4 Evaluation Metrics

While human judgments are the gold standard for evaluating machine-generated texts, collecting them can be difficult, time-consuming, and expensive. Therefore, generally the best practical method for fast iteration on benchmarks is the use of automatic metrics that are shown to correlate well with human judgments. We hypothesize that common reference-based MT evaluation metrics might have differing sensitivities to regional differences, and so we conduct a human evaluation of several baseline models (see §6.1) and compute correlation of several automatic metrics with the human judgments. We also explore two additional metrics specifically for region-awareness: a modified version of our human metric, and an automatic lexical accuracy metric.

4.1 Human evaluation

To obtain the highest fidelity human ratings, we use the expert-based Multidimensional Quality Metrics (MQM) evaluation framework proposed by Freitag et al. (2021a). This is the same framework recommended by the WMT’21 Evaluation Campaign (Freitag et al., 2021b). For our evaluation, expert raters are shown a chunk of 10 contiguous English sentences from our test set with the corresponding translations from one combination of model (or human) and target region. Raters then identify errors in the translations, assigning a category and severity to each; see Freitag et al. (2021a) for details. Due to cost constraints, we evaluate 25% of the test set, evenly distributed across our three evaluation buckets. Each chunk is rated by three raters.

Each output is shown to raters of both regions of the corresponding language; for example, all European Portuguese outputs are shown to speakers of Brazilian Portuguese and to speakers of European Portuguese. All Mandarin outputs are first automatically transliterated into the regional Han script (Mainland: simplified, Taiwan: traditional).

4.2 Automatic translation quality metrics

We evaluate the following automatic, reference-based metrics:

BLEU (Papineni et al., 2002): This metric is based on token -gram precision. We use sacrebleu.corpus_bleu from Post (2018).

chrF (Popović, 2015): This metric is based on F1 using character -grams. We use sacrebleu.corpus_chrf from Post (2018).

BLEURT (Sellam et al., 2020): This is a learned, model-based metric. While the original paper showed that BLEURT has good correlation with human judgments of translation quality, we note that BLEURT has not, to our knowledge, been evaluated with respect to human judgments of region-specific translation quality. We use the authors’ released checkpoint.

BLEURT-D{3,6,12} (Sellam et al., 2020): These are distilled versions of BLEURT that are less resource-intensive to run. They have 3, 6, and 12 layers, respectively. As above, we use checkpoints released by the authors.

4.3 Correlation

For computing correlation, each data point corresponds to a score on a 10-sentence chunk of model output, covering the three models discussed in section §6.1. For MQM, this is the average of 30 weighted ratings: one per sentence per rater. The category/severity weights are described in Freitag et al. (2021a). For BLEU and chrF, which are corpus-level metrics, we take the 10 input/output sentence pairs as the ‘corpus’. For BLEURT, we use the average sentence-level score.

Table 2 presents the correlation results. Note that because high quality corresponds to a low MQM score but a high score in each automated metric, we report negative correlations for intuitiveness. We also multiply each value by 100.

Metric Kendall’s Pearson
chrF 43.5 48.3
BLEU 44.3 56.6
BLEURT-D3 49.5 61.7
BLEURT-D6 50.4 62.8
BLEURT-D12 51.0 63.7
BLEURT 52.2 65.2
Table 2: Coefficient of correlation between human MQM ratings and several automated metrics. chrF has the lowest correlation, with BLEU performing slightly better. All BLEURT models outperform the non-learned metrics, with the full-size model achieving higher correlation than the smaller distillations thereof.

We can make a few observations from Table 2. Firstly, the learned, BERT-based metrics outperform the non-learned metrics, validating observations from Sellam et al. (2020) that neural methods outperform -gram based methods at this task. Secondly, we can see that the teacher model (BLEURT) outperforms the distilled student models, and that the student size makes a relatively modest difference after the initial loss from distilling at all.

4.4 Reweighted MQM

MQM breaks down errors into a number of fine-grained categories, but the framework doesn’t aim to explicitly detect errors deriving from region-specific differences in the target language. As such, we can use aggregate MQM scores as a measure of overall translation quality, but they may not be ideal as a targeted measure of how well a particular model is localizing outputs to a region. We’ll see in §6 that gold translations matching the target region do score better on MQM than those from a mismatched region. However, a strong translation model that entirely ignores regional differences may still outscore a weaker model that is region-aware.

To derive a score that is more sensitive to the features of regional varieties, we filter MQM error categories to those most indicative of region mismatch. Specifically, for each language (Portuguese and Mandarin), we train a logistic regression model over ratings of gold translations to predict whether the rater’s region is matched (0) or mismatched (1) to the translator’s region. The model uses error types as features, and assigns positive coefficients to errors that are associated with region mismatch. We designate features as “mismatch predictive” if they have both a positive coefficient and a significant

-value (<0.05).333We believe a priori that no error type should be robustly indicative of matching region, as having a matched translator and rater should ideally result in no errors. Having identified these error types ( in Portuguese, and

in Mandarin), we run a second regression restricted to just these features. We calculate “Reweighted MQM” (R-MQM) as the probability this model assigns to the mismatch category, giving scores in the range [

,], as opposed to raw MQM scores which can cover [,] if arbitrarily many errors are found.

R-MQM provides a more targeted signal as to whether a model is capturing region-specific features in its predictions. A strongly region-aware model should have large deltas in R-MQM between region-matched and region-mismatched raters, similar to the R-MQM scores of gold translations. Conversely, if R-MQM is similar across the matched/mismatched settings, the model is not making progress on region-aware translation.

4.5 Lexical accuracy

To assess a model’s ability to select lexical forms appropriate to the target region, we define a lexical accuracy metric. As discussed in section §3.1, sentences in our lexical bucket are chosen from Wikipedia articles containing specific words we expect to have distinct translations in the regions in question. For instance, we include source sentences from the English Wikipedia article “Bus” in the Portuguese lexical bucket, as the word for bus is distinct in Brazil and Portugal (ônibus vs. autocarro). As the expected output forms are known ahead of time, we can directly measure the rate at which a model selects region-appropriate variants.

Starting from the list of terms used to select Wikipedia articles for the lexical bucket, we remove the terms selected for the exemplars split in order to test generalization to unseen terms. This results in 18 terms in Portuguese and 13 in Mandarin; each term has two variants, one for each region.

We calculate the metric over all model outputs for the lexical bucket, covering both regional outputs of a given language. For each term we calculate the number of sentences containing a matched variant and the number of sentences containing a mismatched variant; the sum of these over all terms are and , respectively. From this, we calculate the model’s lexical accuracy (LA) score for the given language as:


Note that sentences containing both a matched and mismatched variant are counted in both tallies, and that sentences containing neither are not counted at all. We chose to count the number of sentences containing (mis)matched variants of each term instead of the total number of variant occurrences because we observed rare cases of degenerate model predictions that repeated a single term variant many times, which would skew the metric.

To account for Portuguese inflection, we considered matching lemmatized forms rather than surface forms, but found little difference in the resulting scores. We thus report results using naive surface matching, which avoids a dependency on a specific lemmatizer and improves reproducibility.

In Mandarin, there is an added complexity of script choice. Mainland Mandarin is typically written in the “simplified” Han script, while Taiwan Mandarin is written in “traditional” script. To disentangle lexical choice from script choice, we define lexical accuracy in a script-agnostic manner. For example, for the word pineapple, if the target region is Taiwan, we count both the (expected) traditional and (unexpected) simplified forms of the Taiwan variant fènglí (鳳梨 and 凤梨) as correct, while counting both script forms of the Mainland variant bōluó (菠萝 and 菠蘿) as incorrect. This ensures that even if a model is outputting the wrong script, it will be rewarded or penalized based on its lexical choices. This also prevents the possibility of “gaming” our metric by only using the lexical forms and script of a single region.

We would like to emphasize that lexical choice is only one facet of region-aware translation. Regional varieties may also differ in morphology, syntax, and other ways. Thus, our lexical accuracy metric should not be viewed as a holistic assessment of quality. Nevertheless, it is easy to calculate automatically, and covers one challenging aspect of the problem, so we believe it will be a useful measure to monitor and iterate on. At a minimum, if a model scores poorly on lexical accuracy, we can safely say that it hasn’t solved region-aware translation.

4.6 Reporting FRMT results

For the FRMT task (as opposed to the dataset), we stipulate a key “few-shot” restriction: candidate models may not be intentionally exposed to any region-labeled data at any point during training. This restriction covers both region-labeled monolingual data as well as region-labeled parallel translation data.444Models may train on multilingual web crawl data, as is common practice, as long as supervised region labels are not provided. We allow that some implicit or explicit region labels may appear within the unsupervised data. While it may not be difficult to obtain region labels for Brazil/Portugal or Mainland/Taiwan (e.g. by filtering web pages on top-level web domain), we intend for FRMT to serve as a measure of few-shot generalization to arbitrary regions and language varieties, for which obtaining labels may be much harder.

Researchers sharing FRMT results should report per-bucket BLEU and lexical accuracy metrics on the test set, as shown in Tables 3 and 4. These metrics are efficient to calculate with our provided evaluation scripts.

We also recommend reporting BLEURT scores, but recognize that this may not always be possible, as it requires significantly more computational resources. Similarly, we encourage human evaluation using MQM as a gold standard, but do not wish to promote this as a community metric, as it is impractical for many researchers, and may not be comparable across research groups due to differences in rater pools and training.

Finally, for any model candidate, it is important to report how many exemplars were supplied at inference time. This number might be 0 for some approaches. To improve comparability, we recommend using either 0, 10, or 100 exemplars per region.

5 Baseline Models

We evaluate a handful of new and existing academic MT models that claim some ability to provide few-shot controllable region-aware translations. We also evaluate a commercial MT system that does not distinguish between these regional varieties.

We evaluate a number of models based on the model of Siddhant et al. (2022), which we abbreviate as “M4” (standing for massively multilingual, massive machine translation). Our models are trained on a mixture of monolingual and parallel data from 112 languages mined from the web, and we follow Arivazhagan et al. (2019) by upsampling low-resource language pairs via temperature sampling.

The first model we evaluate is based on the Universal Rewriter of Garcia et al. (2021), and we abbreviate this model as UR. It is designed to support multilingual style transfer and translation, making it a natural fit for our benchmark. As in that work, it is initialized from an mT5-XL checkpoint (Xue et al., 2021) and finetuned on a combination of monolingual and parallel data from mC4 and OPUS, respectively. The only difference is that we train a version with a sequence length of instead of , to be directly comparable to our other models.

Our second baseline model is called M4 UR and is the same as our UR model but is finetuned from a pretrained M4 model instead of an mT5 model. We hypothesize that initializing from a pretrained model explicitly designed for translation (i.e. M4) will outperform one trained as a general language model (i.e. mT5).

Our third baseline model is like our M4 UR model but instead of using the exemplar-based finetuning technique to introduce regional variety control, we adopt an inference-time prompting technique where we bias the model’s generation through the use of specific natural-language prefixes (prompts) to each input, despite not explicitly training the model to distinguish regional varieties. This is based on earlier work showing the effectiveness of this technique for large language models (Wei et al., 2022; Sanh et al., 2022; Brown et al., 2020), and more recent work applying it to region-aware MT (Garcia and Firat, 2022). We call this model M4 Prompts. We select the prompts manually based on experimentation with data from the dev split of our evaluation set. An example prompt is “A Brazilian would write it like this:”. The translated version of the prompt was removed from the model’s output prior to evaluation. Note that this technique does not use exemplars.

Our fourth baseline model is a version of the M4 Prompts model that underwent a finetuning stage on a version of the M4 data where the source-side language tags used to indicate the target language were replaced with prompts of the form “Translate to X:”, where “X” is either “Portuguese” or “Chinese”. This model, called M4 Prompts Finetuned, is designed to explicitly introduce prompting behavior. At inference time, we test the model’s ability to generalize to unseen language (variety) names but replacing “X” with the variety name (e.g. “Brazilian Portuguese”). As with the M4 Prompts model, the prompts are manually selected based on experimentation with dev data, and exemplars are not used.

Our next three baseline models are different-sized versions of PaLM (Chowdhery et al., 2022), a large language model that has demonstrated remarkable zero-shot and few-shot performance on a variety of tasks, often without needing any fine-tuning. Because it was pre-trained on a multilingual corpus, we include it as a baseline, serving as a representative of the class of recent large decoder-only causal language models including GPT-3 (Brown et al., 2020) and OPT-175B (Zhang et al., 2022). We call these models PaLM 540B, PaLM 62B, and PaLM 8B, referring to their approximate parameter counts. The prompt for these models is larger than for the M4 Prompts models and is exemplar-based. It begins with “Translate the following texts from English to X”, where “X” is the name of the language variety. This is followed by ten exemplars selected randomly from the lexical bucket,555The model has a fixed input sequence length, which includes the prompt, and a fixed output sequence length. Through rejection sampling, we ensure that the ten exemplars are short enough to leave at least 128 tokens for the input text, to match the 128 tokens allotted to the output. where each exemplar is put on two lines: the first line contains the English text, prefixed by “English:”, and the second containing the translation in the target variety, prefixed by the variety’s name. At the end of the prompt, we show the model the input text and the language variety prefix, and decode from the model, taking everything up to the next newline as the output (or the end of the line if no newline was generated).

Finally, we examine a publicly-available commercial MT model, which we refer to as Online. It does not support regional varieties for Portuguese or Mandarin. We evaluate this system primarily to test the hypothesis that systems which do not distinguish regional varieties will exhibit a strong bias toward the majority variety, as determined by text availability online.

6 Baseline Model Performance

6.1 Human evaluation results

Figure 2: MQM () (top) and Reweighted MQM () (bottom) scores for gold translations and model predictions in Portuguese (left) and Mandarin (right). Thick “match” bars show scores from raters in the target region. Thin “mismatch” bars show scores from raters in the opposite region. In all conditions, raters prefer region-matched gold translations. PaLM is rated best among our baselines, but still has room for improvement, particularly in Mandarin.

We select three baseline models for human evaluation: M4-UR, M4-Prompts, and PaLM 540B. These were selected to cover a diverse range of modeling techniques and scales.

Figure 2 presents human evaluation metrics for our baseline models on the 25% sample of our test set described in §4.2

. We first consider gold translations: in order to validate that our process for collecting reference translations and human ratings elicited the sought-after linguistic variance, we can examine whether translations by raters in one speaker group were consistently rated higher by other raters in that same group (the “matched” case) than by raters in the alternative speaker group (the “mismatched” case). Indeed, we do see that this is the case; on average, matched reference/rater pairs received an MQM penalty of

, which is significantly lower (better) than the MQM penalty of received on average by mismatched pairs, as measured by a paired -test (). This effect is strongest in the lexical bucket, presumably due to the high rate of region-distinct terms in these sentences.666We note that ratings tend to be somewhat better in the entity bucket, across both matched and mismatched settings. This could indicate that professional translators are less prone to errors when the source is “culturally relevant”, e.g. when translating sentences about an entity native to the area where the target language is spoken.

Moving to model predictions, in Portuguese we find that all models perform better in the region-matched setting, indicating that each model has some ability to localize to Brazil and Portugal. However, in Mandarin, apart from PaLM’s lexical bucket, region match does not lead to MQM gains, suggesting that none of our baseline models is able to use knowledge of the target region to produce a superior translation for Mainland or Taiwan.

Reweighted MQM tells a slightly different story. Recall, this score focuses specifically on errors that correlate with region mismatch in the gold translations. Here, we find all models score better in the region-matched setting. This indicates that, across all regions, each of our baseline models is at least pushing in the right direction: judged as making fewer localization errors when the rater is in the targeted region.

Comparing across models, we find that PaLM performs the best, followed by M4-Prompts and then M4-UR, consistent across both Portuguese and Mandarin. PaLM performs particularly well in the lexical bucket, suggesting that larger models may be better suited to the task of memorizing region-specific lexical variants.

For Mandarin (on the full 25% sample), a large gap remains between expert translations and our baselines (Gold: vs. PaLM: ). Our results indicate that better region handling will be needed to close this gap; our baselines are not effectively leveraging region information (as evidenced by small match vs. mismatch deltas), and even the best-case expert translation scores poorly in the mismatched region (Gold match: vs. Gold mismatch: ).777Note, this is ignoring orthographic differences, as we transliterate outputs to the rater’s script. See section §6.5 for analysis of script differences.

For Portuguese, while PaLM gives impressive results, there is still a meaningful gap with expert translation (Gold: vs. PaLM: ), indicating headroom for our task. There is also the important question of whether competitive performance can be achieved with smaller models, which are better suited for production use-cases.

Figure 3: MQM () (top) and Reweighted MQM () (bottom) scores for gold translations and model predictions, broken down by rater region and target region. For example “BR rates PT” indicates Brazilian raters scoring sentences targeted to Portugal.

Figure 3 breaks down scores by rater and target region. As before, in each setting, raters prefer region-matched over mismatched gold translations. For Portuguese, we find that our pt-PT raters were “harder graders” than pt-BR raters, with a delta of + MQM between the regions in both matched and mismatched settings; by contrast, our Mandarin raters are well calibrated across regions.

One noteworthy finding is that when pt-BR raters rate pt-PT outputs, two models are preferred over the gold translations: M4-Prompts (), beating PaLM (), beating Gold (). However, these “good” scores should not be celebrated, as the opposite ranking emerges when the same outputs are presented to pt-PT raters. Thus, the preference for model outputs over gold is almost certainly due to M4-Prompts and (to a lesser extent) PaLM mistakenly using Brazil-specific localizations even when the target is pt-PT.

6.2 Automated metric results

Lexical Entities Random FRMT
Model pt-BR pt-PT pt-BR pt-PT pt-BR pt-PT pt
UR 37.4 (69.9) 32.7 (68.0) 46.7 (76.3) 40.8 (73.6) 39.8 (70.7) 35.3 (69.2) 38.7 (71.3)
M4-UR 46.7 (74.5) 32.7 (69.7) 53.5 (79.9) 45.4 (77.5) 43.1 (70.9) 32.9 (68.4) 42.0 (73.5)
M4-Prompts 54.1 (77.1) 36.9 (72.1) 56.9 (81.1) 47.3 (78.4) 56.1 (77.5) 41.0 (73.7) 48.2 (76.6)
M4-Prompts FT 45.5 (70.1) 32.5 (67.4) 48.6 (73.8) 40.7 (72.8) 48.1 (70.5) 36.9 (69.0) 41.7 (70.6)
PaLM 8B 38.6 (69.8) 26.7 (65.8) 45.9 (75.9) 38.0 (73.6) 39.3 (69.4) 32.1 (67.8) 36.5 (70.4)
PaLM 62B 49.5 (75.9) 36.7 (72.4) 55.4 (80.1) 46.1 (77.8) 50.3 (75.2) 41.5 (73.5) 46.3 (75.8)
PaLM 540B 53.7 (77.1) 40.1 (73.9) 59.0 (81.2) 49.5 (79.0) 54.8 (76.9) 45.6 (75.5) 50.2 (77.3)
Online 56.2 (78.7) 35.6 (72.3) 56.3 (81.2) 46.9 (78.3) 65.2 (80.5) 42.9 (75.0) 49.8 (77.6)
zh-CN zh-TW zh-CN zh-TW zh-CN zh-TW zh
UR 22.6 (58.5) 13.8 (56.0) 26.7 (67.1) 19.5 (65.3) 26.4 (62.1) 20.4 (61.0) 21.3 (61.7)
M4-UR 33.3 (65.0) 18.9 (58.2) 43.2 (73.0) 31.4 (70.4) 40.8 (65.4) 30.8 (63.6) 32.5 (65.9)
M4-Prompts 33.3 (64.9) 18.3 (57.6) 44.2 (72.5) 32.0 (68.7) 43.7 (67.0) 32.2 (63.4) 33.3 (65.6)
M4-Prompts FT 33.8 (65.7) 18.8 (59.0) 44.8 (73.2) 31.6 (69.8) 42.7 (66.7) 31.5 (64.0) 33.2 (66.4)
PaLM 8B 17.6 (55.7) 13.3 (52.3) 28.1 (65.7) 24.4 (63.9) 21.6 (56.3) 18.2 (56.1) 20.4 (58.3)
PaLM 62B 29.2 (62.2) 20.4 (59.8) 40.2 (71.8) 33.0 (69.9) 34.5 (64.0) 26.0 (63.1) 30.3 (65.1)
PaLM 540B 34.8 (66.5) 24.6 (63.3) 44.9 (74.7) 35.2 (72.5) 40.0 (67.8) 29.6 (66.0) 34.5 (68.4)
Online 39.7 (68.0) 21.9 (61.8) 50.4 (75.0) 37.0 (72.2) 56.1 (72.0) 39.9 (68.7) 40.1 (69.6)
Table 3:

FRMT per-bucket test set results, in the format: BLEU (BLEURT). The “FRMT” score is the geometric mean across regions of the arithmetic mean across buckets.

Table 3 shows the performance of our baseline models on the automated metrics BLEU and BLEURT. The “FRMT” metric is a summary of per-language performance, calculated as the geometric mean across regions of the arithmetic mean across buckets. Recall, while BLEU scores serve as a lightweight point of comparison, we recommend additionally reporting the more costly BLEURT where possible, as this correlates more strongly with human judgments.

As mentioned at the outset, we observe region-agnostic models have a strong bias toward the “majority” region with larger population and presence in web-crawled corpora. This is especially apparent in the lexical bucket, where Online has a + BLEU gap between pt-BR and pt-PT and a + gap between zh-CN and zh-TW.

Within the lexical bucket, we note that PaLM outperforms the public Online model in minority regions (pt-PT and zh-TW) despite being trained in a fully unsupervised manner and not designed for translation. This highlights that even with minimal region-labeled data ( exemplars), it is possible to make meaningful progress over region-agnostic approaches.

Model pt zh
Gold 98.6 94.4
UR 50.4 50.6
M4-UR 51.2 50.9
M4-Prompts 66.7 50.0
M4-Prompts FT 66.7 51.0
PaLM 8B 85.0 69.0
PaLM 62B 90.4 70.8
PaLM 540B 93.2 83.6
Online 50.0 50.0
Table 4: Lexical accuracy on FRMT test. PaLM outperforms other approaches, while region-agnostic models like Online are guaranteed 50%.

Table 4 shows lexical accuracy performance, assessing whether specific terms receive region-appropriate translations. Here, the PaLM models outperform alternatives by a wide margin. As even the smallest PaLM model has more than the parameters of our trained baselines (3.7B parameters each), this suggests that model capacity is a key ingredient for learning to use region-specific terminology in a few-shot manner. However, there is still a significant gap compared to human performance.

Interestingly, while the smaller PaLM models outperform our UR and M4 baselines on lexical accuracy, they underperform on BLEU and BLUERT. This highlights that using region-appropriate terminology is only a small part of the translation task, and especially at smaller sizes, models designed specifically for translation clearly have the advantage.

6.3 Mismatched references

All else being equal, a model that has perfectly solved the FRMT task should achieve higher performance when evaluated against a reference in the target region than a reference from a different region. To measure the extent to which this holds for our baseline models, we show the delta between matched and mismatched reference on the test set in Table 5.

Lexical Entities Random FRMT
Model pt-BR pt-PT pt-BR pt-PT pt-BR pt-PT pt
UR 8.4 (2.7)

7.7 (

6.9 (1.4)

6.3 (

5.2 (0.3)

4.3 (

0.2 (0.1)
M4 UR 13.9 (5.2)

13.0 (

8.2 (2.5)

7.9 (

9.7 (2.9)

9.3 (

0.2 (0.1)
M4 Prompts 19.5 (5.7)

13.6 (

10.5 (2.9)

7.8 (

15.6 (3.4)

12.6 (

1.9 (0.6)
M4 Prompts FT 14.8 (4.9)

9.8 (

8.7 (2.4)

6.4 (

11.8 (2.8)

9.2 (

1.6 (0.5)
PaLM 8B 13.6 (5.1)

5.4 (

8.6 (2.7)

3.3 (

7.6 (1.7)

2.9 (

2.8 (0.9)
PaLM 62B 18.0 (6.2) 0.3 (0.5) 11.5 (3.4) 0.3 (

11.5 (2.5)

0.9 (

6.5 (1.9)
PaLM 540B 20.7 (6.4) 0.2 (0.9) 13.4 (3.7) 0.2 (

13.1 (2.9)

0.1 (0.0)
7.7 (2.2)
Online 20.6 (6.4)

20.6 (

9.5 (2.8)

9.5 (

22.3 (5.5)

22.3 (

0.0 (0.0)
zh-CN zh-TW zh-CN zh-TW zh-CN zh-TW zh
UR 4.4 (1.1)

3.4 (

5.0 (1.8)

4.9 (

3.8 (0.9)

3.0 (

0.3 (0.0)
M4 UR 14.5 (7.0)

14.4 (

11.7 (2.5)

11.5 (

10.0 (1.9)

10.1 (

0.0 (0.0)
M4 Prompts 14.4 (5.6)

14.4 (

11.7 (2.6)

10.9 (

10.7 (2.1)

10.1 (

0.1 (0.0)
M4 Prompts FT 14.2 (5.8)

13.5 (

11.9 (2.7)

11.1 (

9.9 (1.5)

9.6 (

0.1 (

PaLM 8B 6.2 (4.1)

2.3 (

4.0 (1.0)

0.7 (

4.5 (0.5)

1.0 (0.1)
1.7 (0.4)
PaLM 62B 12.3 (3.8)

2.9 (

8.5 (1.8)

3.0 (

8.7 (1.5)

2.6 (0.2)
3.3 (0.9)
PaLM 540B 15.0 (5.0)

0.5 (1.0)
10.2 (2.2)

3.3 (

9.9 (2.1)

1.4 (0.2)
4.7 (1.6)
Online 17.8 (6.2)

17.8 (

13.4 (2.8)

13.4 (

16.3 (3.3)

16.3 (

0.0 (0.0)
Table 5: FRMT test set deltas between matched and mismatched references, shown in the format: BLEU (BLEURT). Negative numbers in a minority region column indicate that the model achieved a higher score when evaluated with respect to the majority region, despite being asked to output for the minority region. The last column shows deltas between FRMT scores evaluated with respect to matched vs. mismatched references.

We observe that, with the exception of PaLM in some evaluation settings, outputs targeting the minority region achieve a higher score when evaluated against references in the majority region. One plausible explanation is that models are overgenerating features of the majority region and struggling to overcome the frequency bias in their training data, which is needed in order to fully solve the task.

6.4 Effect of exemplars

Lexical Entities Random FRMT
Exemplars pt-BR pt-PT pt-BR pt-PT pt-BR pt-PT pt
0 0.0 (-1.0) 0.0 (-1.2) 0.0 (-1.4) 0.0 (-1.6) 0.1 (-0.8) 0.0 (-1.3) 0.0 (1.2)
1 51.3 (77.8) 38.2 (74.4) 56.5 (80.8) 47.5 (78.5) 53.7 (76.6) 45.1 (73.8) 48.4 (77.0)
2 51.8 (77.7) 38.8 (75.0) 57.3 (80.8) 46.9 (78.5) 52.9 (76.1) 45.0 (74.4) 48.5 (77.1)
5 52.8 (77.9) 39.6 (75.0) 57.9 (80.8) 46.8 (78.5) 54.5 (76.4) 45.2 (74.6) 49.1 (77.2)
10 54.0 (78.7) 40.6 (75.3) 58.3 (80.9) 48.2 (78.9) 54.6 (76.5) 46.2 (74.9) 50.0 (77.5)
zh-CN zh-TW zh-CN zh-TW zh-CN zh-TW zh
0 0.0 (0.9) 0.0 (3.2) 0.0 (0.8) 0.1 (6.5) 0.1 (1.6) 0.3 (4.3) 0.1 (2.3)
1 37.2 (70.1) 25.7 (67.5) 47.4 (74.2) 37.7 (73.0) 39.7 (67.0) 32.3 (65.5) 36.3 (69.5)
2 37.7 (70.5) 27.0 (67.8) 48.3 (74.8) 38.5 (73.3) 39.3 (67.1) 32.2 (65.7) 36.9 (69.8)
5 37.2 (70.2) 26.4 (67.2) 48.4 (74.7) 39.0 (73.2) 39.9 (66.8) 32.4 (65.4) 36.9 (69.6)
10 37.6 (70.7) 25.5 (67.2) 48.1 (74.9) 38.6 (72.6) 38.9 (66.9) 30.4 (64.7) 36.2 (69.5)
Table 6: FRMT dev set results of PaLM 540B, when varying the number of exemplars, shown in the format: BLEU (BLEURT). Across both languages, even one exemplar is sufficient for strong results. In Portuguese, increasing to exemplars gives marginal additional gains.

To test sensitivity to the number and choice of exemplars, we evaluate PaLM 540B while varying the set of exemplars used. Table 6 shows the effect of ablating the number of exemplars in the range . We observe that a single exemplar is sufficient to achieve strong results, and gains from additional exemplars are marginal.

To measure the variance in performance across exemplar choice, we re-run PaLM 540B evaluation three times each using either or exemplars, resampling the exemplars on each run. We find that the choice of exemplar has a relatively small effect. With

exemplars, the standard deviations of FRMT-BLEU and FRMT-BLEURT across four runs were below

in each language. Even with just exemplar, the standard deviations remained under .

6.5 Script analysis

Target:zh-CN Target:zh-TW
Model simp trad amb mix simp trad amb mix
Gold 99.8 0.0 0.2 0.0 0.0 99.4 0.4 0.2
UR 98.1 0.1 0.6 1.2 69.7 14.8 0.9 14.7
PaLM 8B 98.0 0.1 0.7 1.3 0.0 98.7 0.9 0.4
PaLM 62B 98.4 0.4 0.5 0.6 0.1 98.8 0.5 0.6
PaLM 540B 99.1 0.0 0.4 0.5 0.0 99.2 0.4 0.4
Table 7: Frequency of script usage (%) of various outputs on the Mandarin portion of the FRMT test set, as detected by the hanzidentifier library (1.0.2). trad: Traditional script. simp: Simplified script. amb: Ambiguous between the two scripts. mix: Mixed use of the two scripts.

We intentionally side-step the script difference between Mainland Mandarin (simplified script) and Taiwan Mandarin (traditional script) by transliterating outputs into the target region’s script. This allows raters to evaluate uniformly in their native script, and allows us to benchmark models even if they were only designed to handle a single script.

Our core FRMT metrics thus ignore script, but we are still interested to test the ability of models to output the correct target script. Table 7 presents script accuracy results attained by passing model outputs to the hanziidentifier Python library. We exclude the M4-based models, as their training data was limited to simplified script.

As expected, gold translations are nearly % in the target script, with the rare exceptions being short sentences that are script-ambiguous, or sentences that explicitly describe how a word is rendered in both scripts. The UR model shows some ability to regulate script, but is heavily biased toward simplified, even when the target is traditional. PaLM models achieve much higher script accuracy, approaching gold translations. This indicates that script transfer is likely within the capability of models that have captured other region-specific distinctions.

6.6 Qualitative analysis

Model Target:pt-BR Target:pt-PT
Gold A legalização do casamento entre pessoas do mesmo sexo em Portugal ocorreu em 17 de maio de 2010. O casamento entre pessoas do mesmo sexo foi legalizado em Portugal a 17 de maio de 2010.
PaLM O casamento entre pessoas do mesmo sexo em Portugal foi legalizado em 17 de maio de 2010. O casamento entre pessoas do mesmo sexo em Portugal foi legalizado a 17 de Maio de 2010.
M4-Prompts O casamento entre pessoas do mesmo sexo em Portugal foi legalizado em 17 de maio de 2010. O casamento entre pessoas do mesmo sexo em Portugal foi legalizado a 17 de maio de 2010.
M4-UR O casamento homoafetivo em Portugal foi legalizado em 17 de Maio de 2010. O casamento homoafetivo em Portugal foi legalizado a 17 de Maio de 2010.
Table 8: Gold and model outputs for the source: Same-sex marriage in Portugal was legalized on 17 May 2010. Phenomena of interest are bolded.
Model Target:zh-CN Target:zh-TW
Gold 并非所有的软件缺陷都是由代码错误导致的。 並非所有軟體缺陷都是因程式碼錯誤所導致。
PaLM 并不是所有的软件缺陷都是由编码错误造成的。 並不是所有的軟體缺陷都是由程式錯誤所造成。
M4-Prompts 并非所有的软件缺陷都是由编码错误引起的。 並非所有的軟件缺陷是由編碼錯誤引起的。
M4-UR 并非所有的软件缺陷都是由编码错误引起的。 並非所有的軟件缺陷都是由編碼錯誤引起的。
Table 9: Gold and model outputs for the source: Not all software defects are caused by coding errors. Phenomena of interest are bolded, and region-specific errors are in red. Note, M4-based model zh-TW outputs have been transliterated to traditional script, matching our evaluation setting.

To provide additional insights on regional differences and model behavior, we manually inspect dev set gold translations and model outputs, across the models sent to human evaluation. In both languages, we observe regional differences beyond just the lexical items underlying our lexical bucket. For instance, in Table 8 and similar examples, we find on <date> phrases tend to be translated with differing prepositions—em in pt-BR and a in pt-PT. As another example, In Table 9, we observe both gold and PaLM outputs use the term 程式 (chéngshì, en:program) only in zh-TW when translating the phrase “coding errors”.

In many cases, PaLM uses the expected region-specific lexical forms, as already reflected in our lexical accuracy metrics. By contrast, we observe the M4-based models are more prone to use terms from the “web majority” region (pt-BR and zh-CN) irrespective of the target. For example, in Table 9, PaLM matches gold translations in using the region-specific terms for software—zh-CN: 软件 (ruǎnjiàn), zh-TW: 軟體 (ruǎntǐ)—while the M4-based models use the zh-CN term throughout (simplified: 软件, traditional: 軟件).

7 Conclusion

In this paper, we introduced FRMT, a new benchmark for evaluating few-shot region-aware machine translation. Our dataset covers regions of Portuguese and Mandarin, and enables fine-grained comparison across region-matched and mismatched conditions, and across different targeted buckets (lexical, entity, random).

While we found the large-scale generalist model PaLM 540B to show impressive few-shot region control, there is still significant room for improvement. None of the models we evaluated match human performance, and the gap is particularly large in Mandarin. Additionally, there is an open research question as to whether robust few-shot regional control can be achieved at more modest model scales.

We are eager to see progress on FRMT, as methods that do well in this few-shot setting are likely to be easily extensible to other regions and styles. We anticipate that the flexibility to adapt to new output styles in the absence of extensive labeled data will be a key factor in making generative text models more useful, inclusive and equitable.


For helpful discussion and comments, we thank Jacob Eisenstein, Noah Fiedel, Macduff Hughes and Mingfei Lau. For feedback around regional differences, we thank Andre Araujo, Chung-Ching Chang, Andreia Cunha, Filipe Gonçalves, Nuno Guerreiro, Mandy Guo, Luis Miranda, Vitor Rodrigues and Linting Xue.