Log In Sign Up

Multilingual Relation Classification via Efficient and Effective Prompting

by   Yuxuan Chen, et al.

Prompting pre-trained language models has achieved impressive performance on various NLP tasks, especially in low data regimes. Despite the success of prompting in monolingual settings, applying prompt-based methods in multilingual scenarios has been limited to a narrow set of tasks, due to the high cost of handcrafting multilingual prompts. In this paper, we present the first work on prompt-based multilingual relation classification (RC), by introducing an efficient and effective method that constructs prompts from relation triples and involves only minimal translation for the class labels. We evaluate its performance in fully supervised, few-shot and zero-shot scenarios, and analyze its effectiveness across 14 languages, prompt variants, and English-task training in cross-lingual settings. We find that in both fully supervised and few-shot scenarios, our prompt method beats competitive baselines: fine-tuning XLM-R_EM and null prompts. It also outperforms the random baseline by a large margin in zero-shot experiments. Our method requires little in-language knowledge and can be used as a strong baseline for similar multilingual classification tasks.


page 1

page 2

page 3

page 4


Cross-Lingual Text Classification with Multilingual Distillation and Zero-Shot-Aware Training

Multilingual pre-trained language models (MPLMs) not only can handle tas...

Multilingual Language Model Adaptive Fine-Tuning: A Study on African Languages

Multilingual pre-trained language models (PLMs) have demonstrated impres...

Soft Layer Selection with Meta-Learning for Zero-Shot Cross-Lingual Transfer

Multilingual pre-trained contextual embedding models (Devlin et al., 201...

Zero-shot Cross-lingual Transfer of Prompt-based Tuning with a Unified Multilingual Prompt

Prompt-based tuning has been proven effective for pretrained language mo...

mGPT: Few-Shot Learners Go Multilingual

Recent studies report that autoregressive language models can successful...

Zero-Shot Cross-lingual Classification Using Multilingual Neural Machine Translation

Transferring representations from large supervised tasks to downstream t...

Cross-Lingual Text Classification of Transliterated Hindi and Malayalam

Transliteration is very common on social media, but transliterated text ...

1 Introduction

Relation classification (RC) is a crucial task in information extraction (IE), aiming to identify the relation between entities in a text alt_2019_fine. Extending RC to multilingual settings has recently received increased interest zou_2018_adversarial; kolluru_2022_alignment, but the majority of prior work still focuses on English baldini-soares_2019_matching; lyu-chen_2021_relation. A main bottleneck for multilingual RC is the lack of supervised resources, comparable in size to large English datasets riedel_2010_modeling; zhang_2017_tacred. The SMiLER dataset seganti_2021_smiler provides a starting point to test fully supervised and more efficient approaches due to different resource availability for different languages. Previous studies have shown the promising performance of prompting PLMs compared to the data-hungry fine-tuning, especially in low-resource scenarios gao_2021_making; le-scao-rush_2021_many; lu_2022_fantastically. Multilingual pre-trained language models conneau_2020_unsupervised; xue_2021_mt5 further enable multiple languages to be represented in a shared semantic space, thus making prompting in multilingual scenarios feasible. However, the study of prompting for multilingual tasks so far remains limited to a small range of tasks such as text classification winata_2021_language and natural language inference lin_2022_few. To our knowledge, the effectiveness of prompt-based methods for multilingual RC is still unexplored. To analyse this gap, we pose two research questions for multilingual RC with prompts:
RQ1. What is the most effective way to prompt? We investigate whether prompting should be done in English or the target language and whether to use soft prompt tokens.
RQ2. How well do prompts perform in different data regimes and languages? We investigate the effectiveness of our prompting approach in three scenarios: fully supervised, few-shot and zero-shot. We explore to what extent the results are related to the available language resources.

Figure 1: Overview of our approach. Given a plain text containing head entity and tail entity from language , we first apply the template ’’ and yield the prompt input with a blank. Then the PLM aims to fill in the relation at the blank. In code-switch prompting, the target sequence is the English relation verbalization. In in-language prompting, the target is the relation name translated into .

We present an efficient and effective prompt method for multilingual RC (see Figure 1) that derives prompts from relation triplets (see Section 3.1). The derived prompts include the original sentence and entities and are supposed to be filled with the relation label. We evaluate the prompts with three variants, two of which require no translation, and one of which requires minimal translation, i.e., of the relation labels only. We find that our method outperforms fine-tuning and a strong task-agnostic prompt baseline in fully supervised and few-shot scenarios, especially for relatively low-resource languages. Our method also improves over the random baseline in zero-shot settings, and achieves promising cross-lingual performance. The main contributions of this work hence are:

  • We propose a simple but efficient prompt method for multilingual RC, which is, to the best of our knowledge, the first work to apply prompt-based methods to multilingual RC (Section 3).

  • We evaluate our method on the largest multilingual RC dataset, SMiLER  seganti_2021_smiler, and compare our method with strong baselines in all three scenarios. We also investigate the effects of different prompt variants, including insertion of soft tokens, prompt language, and the word order of prompting (Sections 45).

2 Preliminaries

We first give a formal definition of the relation classification task, and then introduce fine-tuning and prompting paradigms to perform RC.

2.1 Relation Classification Task Definition

Relation classification is the task of classifying the relationship such as

date_of_birth, founded_by or parents between pairs of entities in a given context. Formally, given a relation set and a text (where are tokens) with two disjoint spans and denoting the head and tail entity, RC aims to predict the relation between and , or give a no_relation prediction if no relation in holds. RC is a multilingual task if the token sequences come from different languages.

2.2 Fine-tuning for Relation Classification

In fine-tuning, a task-specific linear classifier is added on top of the PLM. Fine-tuning hence introduces a different scenario from pre-training, since language model (LM) pre-training is usually formalized as a cloze-style task to predict target tokens at [MASK] devlin_2019_bert; liu_2019_roberta or a corrupted span raffel_2020_t5; lewis_2020_bart. For the RC task, the classifier aims to predict the target class at [CLS] or at the entity spans denoted by Marker  baldini-soares_2019_matching.

2.3 Prompting for Relation Classification

Prompt input Target Example
Input Target
null prompts Goethe schrieb Faust. ____ has author
CS Goethe schrieb Faust. Faust ____ Goethe has author
SP Goethe schrieb Faust. [v1]Faust [v2]____ [v3]Goethe has author
IL Goethe schrieb Faust. Faust ____ Goethe hat Autor
Table 1: Overview of the prompts, including null prompts (baseline), and ours with its variants. For each prompt or its variant, we list (1) the prompt input and the target; (2) an example based on the plain text in German ‘‘Goethe schrieb Faust.’’ [vi]: learnable soft tokens. : the original (English) relation verbalization. : the translated relation verbalization into the target language .

Prompting is proposed to bridge the gap between pre-training and fine-tuning liu_2021_pre-train; gu_2022_ppt. The essence of prompting is, by appending extra text to the original text according to a task-specific template , to reformulate the downstream task to an LM pre-training task such as masked language modeling (MLM), and apply the same training objective during the task-specific training. For the RC task, to identify the relation between ‘‘Angela Merkel’’ and ‘‘Joachim Sauer’’ in the text ‘‘Angela Merkel’s current husband is quantum chemist Joachim Sauer,’’ an intuitive template for prompting can be ‘‘The relation between Angela Merkel and Joachim Sauer is [MASK],’’ and the LM is supposed to assign a higher likelihood to the term couple than to e.g. friends or colleagues at [MASK]. This ‘‘fill-in the blank’’ paradigm is well aligned with the pre-training scenario, and enables prompting to better coax the PLMs for pre-trained knowledge petroni_2019_language.

3 Methods

We now present our method, as shown in Figure 1. We introduce its template and verbalizer, and propose several variants of the prompt. Lastly, we explain the training and inference process.

3.1 Template

For prompting liu_2021_pre-train, a prompt often consists of a template and a verbalizer . Given a plain text , the template adds task-related instruction to to yield the prompt input


Following chen_2021_knowledge and han_2021_ptr, we treat relations as predicates and use the cloze ‘‘ {relation} ’’ for the LM to fill in. Our template is formulated as


In the template , is the original text and the two entities and come from . Therefore, our template does not introduce extra tokens, thus involves no translation at all.

3.2 Verbalizer

After being prompted by , the PLM predicts the masked text at the blank. To complete an NLP classification task, a verbalizer is required to bridge the set of labels and the set of predicted texts (verbalizations . For the simplicity of our prompt, we use the one-to-one verbalizer:


where is a relation, and is the simple verbalization of . normally only involves splitting by ‘‘-’’ or ‘‘_’’ and replacing abbreviations such as org with organization. E.g., the relation org-has-member corresponds to the verbalization ‘‘organization has member’’. Then the prediction is formalized as


where denotes the parameters of model . is normalized by the likelihood sum over all relations.

3.3 Variants

Task Dataset #Class Verbalizations # Token in Verb.
Mean Std.
LA CoLA warstadt_2019_cola 2 correct, incorrect. gao_2021_making 1 0
NER CoNLL03 sang_2003_conll 5 location, person, not an, ... cui_2021_template 1.2 0.4
NLI MNLI williams_2018_mnli 3 yes, no, maybe. fu_2022_polyglot 1 0
NLI XNLI conneau_2018_xnli 3 yes, no, maybe; Evet, ... zhao_2021_discrete 1 0
PI PAWS-X yang_2019_paws 2 yes, no. qi_2022_enhancing 1 0
TC MARC keung_2020_marc 2 good, {average, bad}. huang_2022_zero 1 0
RC TACRED zhang_2017_tacred 42 founded by, city of birth, country of death, ... 3.23 1.99
SemEval hendrickx_2010_semeval 10 cause effect, entity origin, product producer, ... 2.50 0.81
NYT riedel_2010_modeling 24 ethnicity, major shareholder of, religion, ... 2.10 1.01
SciERC luan_2018_scierc 6 conjuction, feature of, part of, used for, ... 2.17 0.69
SMiLER (EN) seganti_2021_smiler 36 birth place, starring, won award, ... 2.58 0.68
SMiLER (ALL) seganti_2021_smiler 36 hat Genre, chef d’organisation, del país, ... 3.66 1.44
Table 2: Statistics of the lengths of the verbalizations over several classification tasks. The lengths for non-RC tasks depend on the tokenizers from the respective PLMs in the cited work. The lengths for RC tasks are based on the mT5Base tokenizer. Mean and std. show that the label space of the RC task is more complex than most few-class classification tasks. The verbalizations of RC datasets are listed in Appendix B. For SemEval, the two possible directions of a relation are combined. For NYT, we use the version from zeng_2018_extracting. For SMiLER, "EN" is the English split; "ALL" contains all data from 14 languages.

To find the optimal way to prompt, we investigate three variants as follows. Hard prompt vs soft prompt (SP)   Hard prompts (a.k.a. discrete prompts) liu_2021_pre-train are entirely formulated in natural language. Soft prompts (a.k.a. continuous prompts) consist of learnable tokens lester_2021_power that are not contained in the PLM vocabulary. Following han_2021_ptr, we insert soft tokens before entities and blanks as shown for SP in Table 1. Code-switch (CS) vs in-language (IL)   Relation labels are in English across almost all RC datasets. Given a text from a non-English input

with a blank, the recovered text is code-mixed after being completed with an English verbalization, corresponding to code-switch prompting. It is probably more reasonable for the PLM to fill in the blank in language

. Inspired by lin_2022_few and zhao_2021_discrete, we machine-translate the English verbalizers into the other languages.111See Appendix B for more examples of translated verbalizations. To translate the verbalizer of the SMiLER dataset, we use DeepL by default and Google Translate when the target language is not supported by DeepL (in case of AR, FA, KO and UK). Table 1 visualizes both code-switch (CS) and in-language (IL) prompting. For English, CS- and IL- prompting are equivalent, since is English itself. Word order of prompting   For the RC task, head-relation-tail triples involve three elements. Therefore, deriving natural language prompts from them requires handling where to put the predicate (relation). In the case of SOV languages, filling in a relation that occurs between and seems less intuitive. Therefore, to investigate if the word order of prompting affects prediction accuracy, we swap the entities and the blank in the SVO-template ‘‘’’ and get ‘‘’’ as the SOV-template.

3.4 Training and Inference

The training and inference setups depend on the employed model. Prompting autoencoding language models requires the verbalizations to be of fixed length, since the length of masks, which is identical with verbalization length, is unknown during inference. Encoder-decoders can handle verbalizations of varying length by nature 

han-2022-genpt; du_2021_all. han_2021_ptr

adjust all the verbalizations in TACRED to a length of 3, to enable prompting with RoBERTa for RC. We argue that for multilingual RC, this fix is largely infeasible, because: (1) in case of in-language prompting on SMiLER, the variance of the length of the verbalizations increases from 0.68 to 1.44 after translation (see Table 

2), and surpasses most of listed monolingual RC datasets (SemEval, NYT and SciERC), making it harder to unify the length; (2) manually adjusting the translated prompts requires manual effort per target language, making it much more expensive than adjusting only English verbalizations. Therefore, we suggest using an encoder-decoder PLM for prompting song_2022_clip. Training objective   For an encoder-decoder PLM , given the prompt input and the target sequence (i.e. label verbalization), we denote the output sequence as . The probability of an exact-match decoding is calculated as follows:


where , denote the -th token of and , respectively. denotes the decoded sequence on the left. represents the set of all the learnable parameters, including those of the PLM , and those of the soft tokens in case of variant ‘‘soft prompt’’. Hence, the final objective over the training set is to minimize the negative log-likelihood:



   We collect the output logits of the decoder,

, where is the vocabulary size of , and is the maximum decode length. For each relation , its score is given by han-2022-genpt:


where we compute by looking up in the -th column of and applying softmax at each time step . We aggregate by addition to encourage partial matches as well, instead of enforcing exact matches. The score is normalized by the length of verbalization in order to avoid predictions favoring longer relations. Finally, we select the relation with the highest score as prediction.

4 Experiments

We implement our experiments using the Hugging Face Transformers library wolf_2020_transformers, Hydra yadan_2019_hydra

and PyTorch 

paszke_2019_pytorch.222We make our code publicly available at for better reproducibility.

We use micro-F1 as the evaluation metric, as the SMiLER paper 

seganti_2021_smiler suggests. To measure the overall performance over multiple languages, we report the macro average across languages, following zhao_2021_discrete and lin_2022_few. We also group the languages by their available resources in both pre-training and fine-tuning datasets for additional aggregate results. Details of the dataset, the models, and the experimental setups are as follows. Further experimental details are listed in Appendix A.

Lang. Fine-tuning data Pre-train tokens
#Class #Train(K) Max. mT5(B) XLM-R(B)
AR 9 9.3 74 57 2.9
DE 22 51.5 84 347 10.3
EN 36 267.6 110 2733 55.6
ES 21 11.1 70 433 9.4
FA 8 2.6 93 52 13.3
FR 22 60.9 83 318 9.8
IT 22 74.0 86 162 5.0
KO 28 18.7 95 26 5.6
NL 22 38.9 76 73 5.0
PL 21 16.8 86 130 6.5
PT 22 43.3 82 146 8.4
RU 8 6.4 69 713 23.4
SV 22 4.5 84 45 0.08
UK 7 1.0 65 41 0.006
Table 3: Statistics of the 14 languages in the SMiLER dataset, including the number of classes, the number of training examples (in thousands), and the maximum text length over train and test splits. Appended to the table are the sizes (in billion tokens) of pre-training corpora of the referred languages for mT5 and XLM-R, respectively.

4.1 Dataset

We conduct an experimental evaluation of our multilingual prompt methods on the SMiLER  seganti_2021_smiler dataset, which contains 1.1M annotated texts across 14 languages. 333Note that SMiLER contains 3 versions of the English split: en (268K training examples), en-small (36K) and en-full (744K). We use the en version by default, unless specified otherwise. Table 3 lists the main statistics of the different languages in the SMiLER dataset. Note that languages have varying number of relations, mostly related to how many samples are present. We do not evaluate other datasets because the only prior multilingual RC dataset that fits our task, RELX koksal-ozgur_2020_relx, contains only 502 parallel examples in 5 languages.

Figure 2: Pre-training and fine-tuning dataset size by language. Four languages groups are distinguishable: English (green) has by far the largest dataset, many other European languages (orange) have large datasets for pre-training and fine-tuning. The three non-European languages (blue) have either less pre-training or fine-tuning data and lowest resource are Swedish and Ukrainian (yellow).

Grouping of the languages   We visualize the languages in Figure 2 based on the sizes of RC training data, but include the pre-training data as well, to give a more comprehensive overview of the availability of resources for each language. We divide the 14 languages into 4 groups, according to the detectable clusters in Figure 2 and language origins.

4.2 Model

For prompting, we use mT5Base xue_2021_mt5, an encoder-decoder PLM that supports 101 languages, including all languages in SMiLER. mT5Base xue_2021_mt5 has 220M parameters.

4.3 Baselines

EN(B) seganti_2021_smiler   EN(B) is the baseline proposed together with the SMiLER dataset. They fine-tune BERTBase on the English training split and report the micro-F1 on the English test split. BERTBase has 110M parameters. XLM-REM   To provide a fine-tuning baseline, we re-implement BERTEM baldini-soares_2019_matching with the Entity Start variant.444

We also open-source our implementation of

XLM-REM at In this method, the top-layer representations at the starts of the two entities are concatenated for linear classification. To adapt BERTEM to multilingual tasks, we change the PLM from BERT to a multilingual autoencoder, XLM-RBase conneau_2020_unsupervised, and refer to this model as XLM-REM. XLM-RBase has 125M parameters. Null prompts logan_2021_cutting   To better verify the effectiveness of our method, we implement null prompts as a strong task-agnostic prompt baseline. Null prompts involve minimal prompt engineering by directly asking the LM about the relation, without giving any task instruction (see Table 1). logan_2021_cutting show that null prompts surprisingly achieve on-par performance with handcrafted prompts on many tasks. For best comparability, we use the same PLM mT5Base.

4.4 Fully Supervised Setup

We evaluate the performance of XLM-REM, null prompts, and our method on each of the 14 languages, after training on the full train split from that language. The prompt input and target of null prompts and our prompts are listed in Table 1. We employ the randomly generated seed 319 for all the evaluated methods. For XLM-REM, we follow baldini-soares_2019_matching and set the batch size to be 64, the optimizer to be Adam with the learning rate

and the number of epochs to be 5. For null prompts and ours, we use AdamW as the optimizer with the learning rate

, as zhang_2022_downstream suggest for most of the sequence-to-sequence tasks, the number of epochs to 5, and batch size to 16. The maximum sequence length is 256 for all methods.

4.5 Few-shot Setup

Few-shot learning is normally cast as a -shot problem, where labelled examples per class are available. We follow chen_2021_knowledge and han_2021_ptr, and evaluate on 8, 16 and 32 shots. The few-shot training set is generated by randomly sampling instances per relation from the training split. The test set is the original test split from that language. We follow gao_2021_making and sample another -shot set from the English train split as validation set

. We tune hyperparameters on

for the English task, and apply these to all languages. We evaluate the same methods as in the fully supervised scenarios, but repeat 5 runs as suggested in gao_2021_making

, and report the mean and standard deviation of micro-F1. We use a fixed set of random seeds {13, 36, 121, 223, 319} for data generation and training across the 5 runs. For

XLM-REM, we use the same hyperparameters as baldini-soares_2019_matching, a batch size of 256, and a learning rate of . For null prompts and our prompts, we set the learning rate to , batch size to 16, and the number of epochs to 20.

4.6 Zero-shot Setup

We consider two scenarios for zero-shot multilingual relation classification. Zero-shot in-context learning   Following kojima_2022_large, we investigate whether PLMs are also decent zero-shot reasoners for RC. This scenario does not require any samples or training. We test the out-of-the-box performance of the PLM by directly prompting it with . Zero-shot in-context learning does not specify further hyperparameters since it is training-free. Zero-shot cross-lingual transfer   In this scenario, following krishnan_2021_multilingual, we fine-tune the model with in-language prompting on the English train split, and then conduct zero-shot in-context tests with this fine-tuned model on other languages using code-switch prompting. Through this setting, we want to verify if task-specific pre-training in a high-resource language such as English helps in other languages. In zero-shot cross-lingual transfer, we use the same hyperparameters and random seed to fine-tune on the English task.

5 Results and Discussion

We first present the results in fully supervised, few-shot and zero-shot scenarios, and then discuss the main findings for answering the research questions in Section 1.

EN(B) - - 94.9 - - - - - - - - - - - 94.9 - - - -
XLM-REM 98.4 95.7 95.9 27.9 0.0 82.6 98.9 64.6 92.2 97.4 97.4 96.9 2.2 5.1 95.9 86.1 54.3 3.7 68.2
null prompts 85.5 81.6 84.7 59.8 71.2 82.6 84.2 63.3 71.4 49.4 12.9 84.9 48.9 46.2 84.7 65.8 73.3 47.6 66.2
CS 95.1 95.4 96.0 74.7 69.2 97.2 98.3 82.1 96.9 94.8 95.3 87.6 48.9 46.2 96.0 92.5 82.1 47.6 84.1
SP 95.1 88.5 96.1 81.1 65.4 97.0 97.1 83.1 59.9 95.6 96.9 87.3 63.0 51.3 96.1 87.9 81.2 57.2 82.7
IL 94.1 94.0 96.0 70.5 73.1 97.2 97.0 83.2 93.5 93.0 85.2 83.3 58.7 71.8 96.0 89.2 83.5 65.2 85.0
Table 4: Fully-supervised results in micro-F1 (%) on the SMiLER dataset. The evaluated methods are the proposed baseline EN(B) seganti_2021_smiler, XLM-REM, null prompts, and ours. EN, H, M, L: macro average across the languages within the respective group. : macro average across all 14 languages. Our variants outperform all baselines along all groups averages, XLM-REM has good results for many high-resource languages. Overall, in-language prompting performs best, especially for lower-resource languages.

5.1 Fully Supervised Results

Table 4 presents the experimental results in the fully supervised scenario, for different methods, languages, and language groups. We see that all the three variants of our method beat the fine-tuning baseline XLM-REM and the prompting baseline null prompts, according to the macro-averaged performance across 14 languages. In-language prompting delivers the most promising result, achieving an average of , which is higher than XLM-REM (68.2) and null prompts (66.2). The other two variants, code-switch prompting with and w/o soft tokens, achieve scores of 84.1 and 82.7, respectively, only 0.9 and 2.3 lower than in-language. All three prompt variants are hence effective in fully supervised scenarios. On a per-group basis, we find that the lower-resourced a language is, the greater an advantage prompting enjoys against fine-tuning. In particular, in-language prompts shows better robustness compared to XLM-REM in low-resource languages. They both yield 95.9-96.0 scores for English, but XLM-REM decreases to 54.3 and 3.7 in Group-M and -L, while in-language prompting still delivers 83.5 and 65.2 .

5.2 Few-shot Results

Table 5 presents the per-group results in few-shot experiments. All the methods benefit from larger . Similarly, in-language prompting still turns out to be the best contender, performing 1st in 8- and 32-shot, and the 2nd in 16-shot. We see that in-language outperforms XLM-REM in all -shots, while code-switch achieves comparable or even lower to XLM-REM for , suggesting that the choice of prompt affects the few-shot performance greatly, thus needs careful consideration. On a per-group basis, we find that in-language prompting outperforms other methods for middle- and low-resourced languages. Similar observations can also be drawn from fully supervised results. We conclude that, with sufficient supervision, in-language is the optimal variant to prompt rather than code-switch. We hypothesize it is due to the pre-training scenario, where the PLM rarely sees code-mixed text santy_2021_bertologicomix.

Shots Method EN H M L
8 XLM-REM 31.8 43.0 27.5 6.6 33.7
null prompts 37.4 27.6 26.6 37.4 29.5
CS 42.2 30.6 27.8 38.4 32.0
SP 45.4 27.8 17.9 33.6 27.4
IL 42.2 40.5 38.3 43.4 40.6
16 XLM-REM 56.4 56.9 34.1 10.4 45.3
null prompts 42.1 31.6 34.3 49.7 35.5
CS 50.5 50.1 41.9 53.9 48.9
SP 53.7 46.7 38.4 49.0 45.8
IL 50.5 45.2 42.1 54.6 46.3
32 XLM-REM 73.2 62.4 44.4 6.5 51.3
null prompts 56.0 36.4 47.7 53.9 42.7
CS 80.9 57.0 65.1 59.4 60.8
SP 61.2 53.5 46.3 63.1 53.9
IL 80.9 63.6 64.2 67.4 65.5
Table 5: Few-shot results by group in micro-F1 (%) on the SMiLER seganti_2021_smiler dataset averaged over five runs. We macro-average results for each language group (see Figure 2) and over all languages (). In-language prompting performs best in most settings and language groups. Our variants are especially strong for medium- and lower-resource language groups. See Table 7 in Appendix C for detailed results with mean and std. for each language.
Random 2.8 11.1 4.6 4.8 12.5 4.6 4.6 3.6 4.6 4.8 4.6 12.5 4.6 14.3
Zero-Shot In-Context Learning
SVO CS 5.5 69.9 10.4 12.7 38.5 13.3 11.2 10.0 12.4 14.0 8.1 52.3 27.2 51.3
IL 2.2 5.2 1.8 5.3 9.2 1.3 3.6 7.6 9.0 1.7 7.1 5.4 25.6
SOV CS 4.8 68.4 10.0 13.2 36.9 12.3 12.6 5.0 11.8 13.4 10.3 52.6 29.4 51.3
IL 3.8 5.0 3.6 59.8 7.7 1.3 3.1 10.0 7.9 1.4 6.0 4.5 25.6
Zero-Shot Cross-Lingual Transfer
EN (268K) - 94.0 94.9 91.7 91.1 96.0 97.5 78.2 97.5 93.3 95.2 93.8 97.8 94.7
EN-small (36K) - 45.9 64.7 73.1 70.3 82.2 77.5 30.8 79.9 59.0 67.3 76.1 77.2 54.1
Table 6: Zero-shot results in micro-F1 (%) on the SMiLER dataset. "SVO" and "SOV": word order of prompting. Overall, Code-switch prompting performs the best in the zero-shot in-context scenario. In cross-lingual transfer experiments, English-task training greatly improves the performance on all the other 13 languages.

5.3 Zero-shot Results

Table 6 presents the per-language results in zero-shot scenarios. We consider the random baseline for comparison zhao_2021_discrete; winata_2021_language. We notice that performance of the random baseline varies a lot across languages, since the languages have different number of classes in the dataset (cf. Table 3), with English being the hardest task. For zero-shot in-context, code-switch prompting always outperforms the random baseline by a large margin, in both word orders, while in-language prompting performs worse than the random baseline in 6 languages. Code-switch prompting outperforms in-language prompting across all the 13 non-English languages, using SVO-template. We assume that, without in-language training, the PLM understands the task best when prompted in English. The impressive performance of code-switch shows the PLM is able to transfer its pre-trained knowledge in English to other languages. We also find that the performance is also highly indicated by the number of classes, with worst scores achieved in EN, KO and PT (36, 28 and 22 classes), and best scores in AR, RU and UK (9, 8 and 7 classes). In addition, we observe that word order does not play a significant role for most languages, except for FA, which is an SOV-language and has 54.5 gain from in-language prompting with an SOV-template. For zero-shot cross-lingual transfer, we see that non-English tasks benefit from English in-domain prompt-based fine-tuning, and the gain improves with the English data size. For 5 languages (ES, FA, NL, SV, and UK), zero-shot transfer after training on 268k English examples delivers even better results than in-language fully supervised training (cf. Table 4). sanh_2022_t0 show that including RC-specific prompt input in English during pre-training can help in other languages.

5.4 Discussion

Based on the results above, we answer the research questions from Section 1. RQ1. Which is the most effective way to prompt? In the fully-supervised and few-shot scenario, in-language prompting displays the best results. This appears to stem from a solid performance across all languages in both settings. Its worst performance is 31.8 for Polish 8-shot (see Table 7 in Appendix C). All other methods have results lower than 15.0 for some language. This indicates that with little supervision mT5 is able to perform the task when prompted in the language of the original text. However, zero-shot results strongly prefer code-switch prompting. It could follow that, without fine-tuning, the model’s understanding of this task is much better in English. RQ2. How well does our method perform in different data regimes and languages? Averaged over all languages, all our variants outperform the baselines, except for 8-shot. For some high-resource languages, XLM-REM is able to outperform our method. On the other hand, for low-resource languages null prompts are a better baseline which we consistently outperform. This could indicate that prompting the underlying mT5 model is better suited for multilingual RC on SMiLER. Overall, the results suggest that minimal translation can be very helpful for multilingual relation classification.

6 Related Work

Multilingual relation classification   Previous work in multilingual RC has primarily focused on traditional methods rather than prompting PLMs. faruqui-kumar_2015_multilingual machine-translate non-English full text to English to deal with multilinguality. akbik_2016_multilingual employ a shared semantic role labeler to get language-agnostic abstraction and apply rule-based methods to classify the unified abstractions. lin_2017_neural employ convolutional networks to extract relation embeddings from texts, and propose cross-lingual attention between relation embeddings to model cross-lingual information consistency. sanh_2019_hierarchical leverage the embeddings from BiLSTM, which is trained with a set of selected semantic tasks to help (multilingual) relation extraction. koksal-ozgur_2020_relx fine-tune (multilingual) BERT, classifying the embedding at [CLS]. To take entity-related embeddings into consideration as well, nag_2021_data add an extra summarization layer on top of a multilingual BERT to collect and pool the embeddings at both [CLS] and entity starts. Multilingual prompting   Multilingual prompting is a new yet fast-growing topic. winata_2021_language reduce handcrafting efforts by reformulating general classification tasks into binary classification with answers restricted to true or false for all languages. huang_2022_zero propose a unified multilingual prompt by introducing a so-called ‘‘two-tower’’ encoder, with the template tower producing language-agnostic prompt representation, and the context tower encoding text information. fu_2022_polyglot manually translate prompts and suggest multilingual multitask training to boost the performance for a target downstream task.

7 Conclusion

In this paper, we present a first, simple yet efficient and effective prompt method for multilingual relation classification, by translating only the relation labels. Our prompting outperforms fine-tuning and null prompts in fully supervised and few-shot experiments. With supervised data, in-language prompting enjoys the best performance, while in the zero-shot scenarios prompting in English is preferable. We attribute the good performance of our method to its well-suitedness for RC, with the derivation of entity1-relation-entity2 prompts from relation triples. We would like to see our method extended to similar tasks, such as semantic role labeling, with a structure between concepts that can be described in natural language.


We acknowledge the main limitation of this work is that we only experiment on one dataset with 14 languages. Multilingual RC datasets prior to SMiLER are limited in the coverage of languages or in the size of unique training examples. It would be interesting to see how our method performs on other multilingual RC datasets, especially for underrepresented languages winata_2022_nusax. We restrict the target language to be supported by the underlying PLM. The popular multilingual PLMs, mT5 and mBART, include 101 and 25 languages during pre-training. We rely on these PLMs and fail to study true low-resource languages that are not represented in such PLMs aji_2022_one. It is noticeable that in the fully supervised scenario, for 7 out of the 14 languages, at least one method achieves over 0.95 micro- score. We hypothesize that is due to high homogeneity in and between the train and test split. If so, the dataset itself might not be challenging, which could indicate that the results are mostly measuring how well the model is able to fit a few indicators (quickly). Like most other prompt methods, ours requires the label names to be natural language which are indicative of the class. Therefore, our method would suffer from labels being non-descriptive.

Ethics Statement

We use automated machine translation by Google Translate and DeepL for our method. These MT systems contain biases regarding, e.g., gender (‘‘has-author’’: ‘‘hat Autor’’) where gender-neutral English nouns are translated to gendered nouns in target languages. In this work we evaluate SMiLER seganti_2021_smiler, which is crawled from Wikipedia. In the paper, they have not stated measures that prevent collecting sensitive text. Therefore, we do not rule out the possible risk of sensitive content in the data. The PLMs involved in this paper are BERTBase for EN(B), XLM-RBase for XLM-REM, and mT5Base for null prompts and ours. BERTBase is pre-trained on the BooksCorpus (zhu2015aligning) and English Wikipedia. XLM-REM is pre-trained on a CommonCrawl corpus. mT5Base is pre-trained on mC4, a filtered CommonCrawl corpus. All our published models may have inherited biases from these corpora.


We would like to thank Nils Feldhus and the anonymous reviewers for their valuable comments and feedback on the paper. This work has been supported by the German Federal Ministry for Economic Affairs and Climate Action as part of the project PLASS (01MD19003E), and by the German Federal Ministry of Education and Research as part of the projects CORA4NLP (01IW20010) and BBDC2 (01IS18025E).


Appendix A Experimental Details

a.1 Hyperparameter Search

We investigated the following possible hyperparameters for few-shot settings. For fully-supervised, we take hyperparameters from literature (see Section 4.4). Number of epochs: ; Learning rate: . Batch size: , not tuned but selected based on available GPU VRAM. We manually tune these hyperparameters, based on the micro- score on the validation set.

a.2 Computing Infrastructure

Fully supervised experiments are conducted on a single A100-80GB GPU. Few-shot and zero-shot experiments are conducted on a single A100 GPU.

a.3 Average Running Time

Fully supervised   It takes 5 hours to train for 1 run with mT5BASE and a prompt method (null prompts, CS, SP and IL) on either English, or all other languages in total. With XLM-REM the running time is 3 hours. Few-shot   It takes 20 (8-shot), 26 (16-shot), and 36 minutes (32-shot) for 1 run with mT5BASE and a prompt method over all languages. With XLM-REM the running time is 8 minutes. Zero-shot   For zero-shot in-context experiments, it takes 6 minutes with mT5BASE and a prompt method over all languages. For zero-shot cross-lingual transfer, the running time equals English training time (5 hours) plus inference-only time (6 minutes).

Appendix B Verbalizers for SMiLER

  • EN    "birth-place": "birth place", "eats": "eats", "event-year": "event year", "first-product": "first product", "from-country": "from country", "has-author": "has author", "has-child": "has child", "has-edu": "has education", "has-genre": "has genre", "has-height": "has height", "has-highest-mountain": "has highest mountain", "has-length": "has length", "has-lifespan": "has lifespan", "has-nationality": "has nationality", "has-occupation": "has occupation", "has-parent": "has parent", "has-population": "has population", "has-sibling": "has sibling", "has-spouse": "has spouse", "has-tourist-attraction": "has tourist attraction", "has-type": "has type", "has-weight": "has weight", "headquarters": "headquarters", "invented-by": "invented by", "invented-when": "invented when", "is-member-of": "is member of", "is-where": "located in", "loc-leader": "location leader", "movie-has-director": "movie has director", "no_relation": "no relation", "org-has-founder": "organization has founder", "org-has-member": "organization has member", "org-leader": "organization leader", "post-code": "post code", "starring": "starring", "won-award": "won award";

  • DE    "birth-place": "Geburtsort", "event-year": "Veranstaltungsjahr", "from-country": "vom Land", "has-author": "hat Autor", "has-child": "hat Kind", "has-edu": "hat Bildung", "has-genre": "hat Genre", "has-occupation": "hat Beruf", "has-parent": "hat Elternteil", "has-population": "hat Bevölkerung", "has-spouse": "hat Ehepartner", "has-type": "hat Typ", "headquarters": "Hauptsitz", "is-member-of": "ist Mitglied von", "is-where": "gelegen in", "loc-leader": "Standortleiter", "movie-has-director": "Film hat Regisseur", "no_relation": "keine Beziehung", "org-has-founder": "Organisation hat Gründer", "org-has-member": "Organisation hat Mitglied", "org-leader": "Organisationsleiter", "won-award": "gewann eine Auszeichnung";

  • ES    "birth-place": "lugar de nacimiento", "event-year": "año del evento", "from-country": "del país", "has-author": "tiene autor", "has-child": "tiene hijo", "has-edu": "tiene educación", "has-genre": "tiene género", "has-occupation": "tiene ocupación", "has-parent": "tiene padre", "has-population": "tiene población", "has-spouse": "tiene cónyuge", "has-type": "tiene tipo", "headquarters": "sede central", "is-member-of": "es miembro de", "is-where": "situado en", "loc-leader": "líder de ubicación", "movie-has-director": "película cuenta con el director", "no_relation": "sin relación", "org-has-founder": "organización cuenta con el fundador", "org-has-member": "organización tiene miembro", "won-award": "ganó el premio";

  • FR    "birth-place": "lieu de naissance", "event-year": "année de l’événement", "from-country": "du pays", "has-author": "a un auteur", "has-child": "a un enfant", "has-edu": "a une éducation", "has-genre": "a un genre", "has-occupation": "a une profession", "has-parent": "a un parent", "has-population": "a de la population", "has-spouse": "a un conjoint", "has-type": "a le type", "headquarters": "siège social", "is-member-of": "est membre de", "is-where": "situé à", "loc-leader": "guide d’emplacement", "movie-has-director": "le film a un réalisateur", "no_relation": "aucune relation", "org-has-founder": "l’organisation a un fondateur", "org-has-member": "l’organisation a un membre", "org-leader": "chef d’organisation", "won-award": "a remporté le prix";

  • IT    "birth-place": "luogo di nascita", "event-year": "anno dell’evento", "from-country": "dal paese", "has-author": "ha autore", "has-child": "ha un figlio", "has-edu": "ha un’educazione", "has-genre": "ha genere", "has-occupation": "ha occupazione", "has-parent": "ha un genitore", "has-population": "ha una popolazione", "has-spouse": "ha un coniuge", "has-type": "ha il tipo", "headquarters": "sede centrale", "is-member-of": "è membro di", "is-where": "situato in", "loc-leader": "leader della posizione", "movie-has-director": "il film ha direttore", "no_relation": "nessuna relazione", "org-has-founder": "l’organizzazione ha fondatore", "org-has-member": "l’organizzazione ha un membro", "org-leader": "leader dell’organizzazione", "won-award": "ha vinto un premio";

  • KO    "birth-place": "출생지", "event-year": "이벤트 연도", "first-product": "첫 번째 제품", "from-country": "나라에서", "has-author": "저자가 있다", "has-child": "아이가 있다", "has-edu": "교육이 있다", "has-genre": "장르가 있다", "has-highest-mountain": "가장 높은 산이 있다", "has-nationality": "국적이 있다", "has-occupation": "직업이 있다", "has-parent": "부모가 있다", "has-population": "인구가 있다", "has-sibling": "형제가 있다", "has-spouse": "배우자가 있다", "has-tourist-attraction": "관광명소가 있다", "has-type": "유형이 있습니다", "headquarters": "본부", "invented-by": "에 의해 발명", "invented-when": "언제 발명", "is-member-of": "의 회원입니다", "is-where": "어디에", "movie-has-director": "영화에 감독이 있다", "no_relation": "관계가 없다", "org-has-founder": "조직에는 설립자가 있습니다", "org-has-member": "조직에 구성원이 있습니다", "org-leader": "조직 리더", "won-award": "수상";

  • NL    "birth-place": "geboorteplaats", "event-year": "evenementenjaar", "from-country": "van het land", "has-author": "heeft auteur", "has-child": "heeft kind", "has-edu": "heeft onderwijs", "has-genre": "heeft genre", "has-occupation": "heeft beroep", "has-parent": "heeft ouder", "has-population": "heeft bevolking", "has-spouse": "heeft echtgenoot", "has-type": "heeft type", "headquarters": "hoofdkantoor", "is-member-of": "is lid van", "is-where": "gevestigd in", "loc-leader": "locatieleider", "movie-has-director": "film had regisseur", "no_relation": "geen relatie", "org-has-founder": "organisatie heeft oprichter", "org-has-member": "organisatie heeft lid", "org-leader": "organisatieleider", "won-award": "won prijs";

  • PL    "birth-place": "miejsce urodzenia", "event-year": "rok imprezy", "from-country": "z kraju", "has-author": "ma autor", "has-child": "ma dziecko", "has-edu": "ma wykształcenie", "has-genre": "ma gatunek", "has-occupation": "ma zawód", "has-parent": "ma rodzica", "has-population": "ma ludność", "has-spouse": "ma współmałżonka", "has-type": "ma typ", "headquarters": "siedziba główna", "is-member-of": "jest członkiem", "is-where": "mieszczący się w", "loc-leader": "lider lokalizacji", "movie-has-director": "film ma reżysera", "org-has-founder": "organizacja ma założyciela", "org-has-member": "organizacja ma członków", "org-leader": "lider organizacji", "won-award": "otrzymał nagrodę";

  • PT    "birth-place": "local de nascimento", "event-year": "ano do evento", "from-country": "do país", "has-author": "tem autor", "has-child": "tem filho", "has-edu": "tem educação", "has-genre": "tem género", "has-occupation": "tem ocupação", "has-parent": "tem pai", "has-population": "tem população", "has-spouse": "tem cônjuge", "has-type": "tem tipo", "headquarters": "sede", "is-member-of": "é membro de", "is-where": "localizado em", "loc-leader": "loc leader", "movie-has-director": "filme tem realizador", "no_relation": "sem relação", "org-has-founder": "organização tem fundador", "org-has-member": "organização tem membro", "org-leader": "líder da organização", "won-award": "ganhou prémio";

  • RU    "event-year": "год события", "has-edu": "имеет образование", "has-genre": "имеет жанр", "has-occupation": "имеет профессию", "has-population": "имеет население", "has-type": "имеет тип", "is-member-of": "является членом", "no_relation": "без связи";

  • SV    "birth-place": "födelseort", "event-year": "År för evenemanget", "from-country": "från ett land", "has-author": "har en författare", "has-child": "har chili", "has-edu": "har utbildning", "has-genre": "har en genre", "has-occupation": "har ockuperat", "has-parent": "har en förälder", "has-population": "har en befolkning", "has-spouse": "har make eller maka", "has-type": "har typ", "headquarters": "huvudkontor", "is-member-of": "är medlem i", "is-where": "som ligger i", "loc-leader": "platsansvarig", "movie-has-director": "filmen har regissör", "no_relation": "ingen relation", "org-has-founder": "organisationen har en grundare", "org-has-member": "organisationen har en medlem", "org-leader": "ledare för organisationen", "won-award": "vann ett pris";

  • UK    "event-year": "рік події", "has-edu": "має освіту", "has-genre": "має жанр", "has-occupation": "має заняття", "has-population": "має населення", "has-type": "має тип", "no_relation": "ніякого відношення".

Appendix C Detailed Few-shot Results

Shots Method AR DE EN ES FA FR IT
8 XLM-REM 58.820.2 49.27.2 31.811.3 12.86.4 7.34.6 30.64.0 52.35.0
null prompts 17.210.6 28.116.6 37.410.1 10.47.9 25.810.1 14.610.4 28.022.3
CS 19.610.2 11.117.2 42.217.5 26.221.5 45.012.0 36.317.4 42.34.9
SP 14.25.5 29.118.6 45.47.9 32.512.1 18.812.0 20.111.0 26.819.1
IL 33.425.4 39.019.3 42.217.5 37.915.3 46.028.5 39.115.9 35.119.0
16 XLM-REM 67.717.5 44.323.1 56.44.2 19.66.5 7.89.4 47.58.4 76.14.3
null prompts 34.518.4 18.120.4 42.115.5 20.512.0 43.214.9 28.722.0 38.018.9
CS 36.618.1 62.511.0 50.532.3 26.121.5 49.711.1 47.330.3 53.527.7
SP 38.617.6 40.229.4 53.725.2 52.013.8 37.914.0 51.327.3 46.624.3
IL 47.032.3 62.511.0 50.532,3 31.122.2 45.623.1 21.717.5 32.818.1
32 XLM-REM 81.69.4 59.929.8 73.24.4 21.43.1 12.76.3 58.810.1 81.02.6
null prompts 45.420.0 26.024.3 56.013.4 14.315.1 67.46.3 48.616.8 42.821.0
CS 62.026.7 72.115.0 80.94.3 40.630.3 61.028.2 51.422.1 50.437.9
SP 50.319.7 35.533.8 61.229.3 60.826.7 59.027.1 74.212.5 34.736.2
IL 65.522.0 61.929.5 80.94.3 53.128.5 76.49.9 62.029.1 71.726.6
8 XLM-REM 16.57.9 38.110.3 46.610.0 53.83.8 60.77.2 1.30.8 12.07.0 33.7
null prompts 36.914.1 44.48.5 29.715.2 26.117.0 39.212.3 47.414.5 27.415.6 29.5
CS 18.717.4 28.015.6 26.716.6 27.014.1 47.515.4 48.916.6 28.017.5 32.0
SP 20.619.4 31.814.7 26.316.2 26.015.2 29.616.6 36.425.3 30.828.8 27.4
IL 35.519.9 52.54.2 31.812,9 32.819.3 55.914.7 34.126.9 52.814.2 40.6
16 XLM-REM 26.75.0 64.72.8 62.85.6 69.12.8 70.98.6 1.30.4 19.512.2 45.3
null prompts 25.319.0 37.514.3 37.88.8 17.816.1 54.320.3 56.623.5 42.820.6 35.5
CS 39.39.4 71.29.0 33.525.0 45.319.1 61.226.1 49.424.7 58.422.4 48.9
SP 38.617.6 40.229.4 53.725.2 52.013.8 37.914.0 51.327.3 46.624.3 45.8
IL 33.720.2 39.211.4 58.518.9 50.219.4 65.46.5 51.122.9 58.220.9 46.3
32 XLM-REM 38.83.3 74.52.8 77.71.6 63.226.2 62.512.8 1.31.3 11.75.9 51.3
null prompts 30.229.8 54.925.6 40.721.7 15.116.9 48.433.7 49.730.7 58.127.4 42.7
CS 72.26.9 71.423.5 39.030.0 73.38.0 57.720.3 67.612.0 51.323.7 60.8
SP 29.634.7 42.733.5 67.412.3 47.428.0 65.119.0 69.220.1 57.032.1 53.9
IL 50.824.9 71.312.5 65.225.8 59.528.5 63.827.4 63.626.1 71.117.4 65.5
Table 7: Few-shot results in micro-F1 (%) on the SMiLER dataset. We evaluate XLM-REM, null prompts, and our prompt variants. For each result, the mean and standard deviation of 5 runs are reported. : macro average across 14 languages. The standard deviations are quite large which indicates that multiple runs are needed and results are seed dependent. In-language prompting provides the most consistent results, with Polish 8-shot as lowest score (31.8 ). Other methods all have results below 15.0 .