Do Multi-Lingual Pre-trained Language Models Reveal Consistent Token Attributions in Different Languages?

12/23/2021
by   Junxiang Wang, et al.
Emory University
2

During the past several years, a surge of multi-lingual Pre-trained Language Models (PLMs) has been proposed to achieve state-of-the-art performance in many cross-lingual downstream tasks. However, the understanding of why multi-lingual PLMs perform well is still an open domain. For example, it is unclear whether multi-Lingual PLMs reveal consistent token attributions in different languages. To address this, in this paper, we propose a Cross-lingual Consistency of Token Attributions (CCTA) evaluation framework. Extensive experiments in three downstream tasks demonstrate that multi-lingual PLMs assign significantly different attributions to multi-lingual synonyms. Moreover, we have the following observations: 1) the Spanish achieves the most consistent token attributions in different languages when it is used for training PLMs; 2) the consistency of token attributions strongly correlates with performance in downstream tasks.

READ FULL TEXT VIEW PDF
02/15/2022

Enhancing Cross-lingual Prompting with Mask Token Augmentation

Prompting shows promising results in few-shot scenarios. However, its st...
12/31/2020

ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora

Recent studies have demonstrated that pre-trained cross-lingual models a...
07/10/2022

FairDistillation: Mitigating Stereotyping in Language Models

Large pre-trained language models are successfully being used in a varie...
01/24/2021

WangchanBERTa: Pretraining transformer-based Thai Language Models

Transformer-based language models, more specifically BERT-based architec...
01/18/2022

Instance-aware Prompt Learning for Language Understanding and Generation

Recently, prompt learning has become a new paradigm to utilize pre-train...
05/25/2022

Discovering Language-neutral Sub-networks in Multilingual Language Models

Multilingual pre-trained language models perform remarkably well on cros...
11/03/2020

Towards Code-switched Classification Exploiting Constituent Language Resources

Code-switching is a commonly observed communicative phenomenon denoting ...

1 Introduction

The cross-lingual zero-shot transfer is a fundamental task in the NLP domain to overcome language barriers, whose goal is to transfer model information trained from source/high-resource languages (i.e. English) to target/low-resource languages (i.e. Hindi) in the absence of explicit supervision. Multi-lingual Pre-trained Language Models (PLMs) such as multi-lingual BERT (mBERT) Pires et al. (2019), XLM CONNEAU and Lample (2019) and XLM-Roberta (XLM-R) Conneau et al. (2020), have demonstrated superior performance in many cross-lingual zero-shot downstream tasks such as natural language inference and question answering.
However, the understanding why multi-lingual PLMs perform surprisingly well is still an open domain. Previous works have investigated them extensively in various aspects. For example, they have been studied by the linguistic properties Chi et al. (2020); Edmiston (2020); Pires et al. (2019); Rama et al. (2020); Kulmizev et al. (2020), language neutrality Libovickỳ et al. (2019, 2020), layer representation de Vries et al. (2020); Singh et al. (2019); Tenney et al. (2019); Karthikeyan et al. (2019); Wu and Dredze (2019), and language generation Rönnqvist et al. (2019). Another line of related work is to understand the multi-lingual model representation in the parallel corpus Kudugunta et al. (2019). They include the probing technique to investigate linguistic properties such as typological features Vulić et al. (2020a); Ravishankar et al. (2019b, a); Bjerva and Augenstein (2021, 2018); Choenni and Shutova (2020); Oncevay et al. (2020), and the isomorphism measure Liu et al. (2019); Patra et al. (2019); Søgaard et al. (2018); Vulić et al. (2020b).
Even though existing literature has made much progress on the interpretation of multi-lingual PLMs, to the best of our knowledge, there still lacks an investigation on the attribution (i.e. importance) of multi-lingual tokens to the predictions of PLMs in the downstream tasks. This facilitates the understanding of how multi-lingual PLMs distinguish important tokens from others trained in source languages, and whether the understanding of tokens can be transferred to target languages. In this paper, we explore the following question in the downstream tasks, which require parallel texts (i.e. texts placed alongside their translations): Do multi-lingual PLMs reveal consistent token attributions in different languages? To address this, we propose a Consistency of Token Attributions (CCTA) evaluation framework. This is different from isomorphism frameworks from previous works, since they focus on the representation of tokens (i.e. token embeddings), while we focus on the importance of tokens (i.e. token attributions). Extensive experiments in three benchmark datasets (i.e. three downstream tasks) demonstrate that multi-lingual PLMs attach different attributions to multi-lingual synonyms. Moreover, we have the following observations: 1) the Spanish achieves the most consistent token attributions in different languages when it is used for training PLMs; 2) the consistency of token attributions strongly correlates with performance in downstream tasks.

Figure 1: Overall architecture of our proposed CCTA evaluation framework.

2 CCTA Framework

This section introduces our proposed CCTA framework to evaluate the consistency of token attributions of multi-lingual PLMs in the downstream tasks, which require parallel texts. Each parallel text consists of a text and its translation from a source language and a target language, respectively. The source language is used to train multi-lingual PLMs, while the target language is used to evaluate them. Figure 1

shows the overall architecture. Firstly, we use a state-of-the-art Layer-based Integrated Gradients (LIG) method to trace token attributions. Next, we align multi-lingual token embeddings into a common comparable semantic space. Finally, the well-known earth mover’s similarity is utilized to measure the consistency of token attributions via optimizing a linear programming problem. All steps are detailed in the following sections.

2.1 Token Attribution Quantification

Given any parallel texts, the state-of-the-art Layer-based Integrated Gradients (LIG) Sundararajan et al. (2017)

is applied to quantify token attributions. In contrast with previous attribution methods, LIG follows the axioms of attribution methods and tease apart errors from the misbehaviors of multi-lingual PLMs. It measures how the input gradient is changed by a relative path, and therefore needs a reference (i.e. baseline). Given an input vector

, its baseline (i.e. the starting point of path ), and a prediction function , the change of gradient along the path is shown as follows:

(1)

where is the -th dimension of . Obviously, as increases from to , the path starts from to , and LIG integrates the gradient along the path. Equation (1) requires the differentiability of . Unfortunately, the input of a multi-lingual PLM is a sequence of non-differentiable token IDs. To address this issue, the embedding layer of a multi-lingual PLM is chosen to be an origin as the input and all embedding attributions are aggregated. The baseline in Equation (1

) is chosen as follows: we leave separation tokens and replace other tokens with padding tokens in any sentence. Let

be the dimensionality of the embedding layer, given a parallel text , where and are the -th and -th tokens of sentences and , respectively, attributions are aggregated mathematically as follows:

where , and are attributions of and , respectively. Namely, the attributions of tokens and are the sum of their attributions along the dimensionality of the embedding layer.

2.2 Multi-Lingual Aligned Token Embeddings

Multi-lingual PLMs usually provide contextual embeddings, which are mapped in different semantic spaces Peters et al. (2018). In order to bridge the semantic gap, token embeddings are aligned to a shared context-free semantic space. Suppose and are denoted as embeddings of and

in such a shared semantic space, respectively, then the semantic similarity between them is measured by the cosine similarity, which is shown as follows:

2.3 Consistency of Token Attributions

Finally, the well-known Earth mover’s similarity is used to measure the consistency of token attributions between a source language and a target language Hitchcock (1941). It is obtained by optimizing a linear programming problem as follows:

(2)

where is the consistency of token attributions, and is the maximal length of sentences and . and are denoted as the normalized values of and , respectively, or namely . The weight quantifies the consistency of token attributions from to . The larger is, the more likely multi-lingual PLMs attach equal importance to multi-lingual synonyms. Equation (2) can be efficiently optimized by an existing linear programming solver Mitchell et al. (2011).

3 Experiments

In this section, we evaluate our proposed CCTA framework in three datasets. All experiments were conducted on a Linux server with the Nvidia Quadro RTX 6000 GPU and 24GB memory 111The code and the empirical study are available in the supplementary materials..

3.1 Datasets and Models

The CCTA evaluation framework was tested in three datasets (i.e. three downstream tasks): XNLI, PAWS-X, and XQuAD. XNLI is a benchmark dataset used for the cross-lingual natural language inference task, and 13 languages were studied in the XNLI: English (en), Arabic (ar), Bulgarian (bg), German (de), Greek (el), Spanish (es), French (fr), Hindi (hi), Russian (ru), Thai (th), Turkish (tr), Vietnamese (vi), and Chinese (zh) Hu et al. (2020). PAWS-X is a benchmark dataset used for the cross-lingual paraphrase identification task, and six languages were studied in the PAWS-X: English (en), German (de), Spanish (es), France (fr), Korea (ko), and Chinese (zh) Hu et al. (2020). XQuAD is a benchmark dataset used for the cross-lingual question answering task, and 11 languages were explored in the XQuAD: English (en), Arabic (ar), German (de), Greek (el), Spanish (es), Hindi (hi), Russian (ru), Thai (th), Turkish (tr), Vietnamese (vi) and Chinese (zh) Hu et al. (2020).

Three state-of-the-art multi-lingual PLMs are utilized for comparison: multi-lingual BERT (mBERT), Cross-lingual Language Model (XLM), and XLM-Roberta (XLM-R). The XLM and the XLM-R both have a base model and a large model. We fine-tuned all models for 10 epochs trained in English and selected the best models based on the performance of the built-in dev sets of all languages in three datasets. Finally, the performance was evaluated in the test set of all languages in three datasets. The Adam optimizer was used with a learning rate of 1e-5 and no weight decay

Kingma and Ba (2015). The batch sizes for the base model and the large model of the XLM and the XLM-R were set to 128 and 32, respectively, and the batch size for the mBERT was set to 32. All models were pretrained using Masked Language Modeling (MLM).

Model en ar bg de el es fr hi ru th tr vi zh Overall
mBERT 1.000 0.194 0.240 0.270 0.245 0.331 0.290 0.217 0.229 0.207 0.245 0.232 0.336 0.310
XLM-Base 1.000 0.206 0.256 0.291 0.257 0.352 0.310 0.219 0.249 0.214 0.262 0.243 0.352 0.324
XLM-Large 1.000 0.198 0.245 0.273 0.250 0.334 0.294 0.222 0.232 0.211 0.248 0.235 0.340 0.314
XLM-R-Base 1.000 0.195 0.242 0.270 0.246 0.330 0.290 0.219 0.230 0.210 0.244 0.233 0.335 0.311
XLM-R-Large 1.000 0.198 0.245 0.275 0.248 0.334 0.294 0.222 0.234 0.212 0.248 0.234 0.342 0.314
Table 2: All consistency scores of token attributions in the PAWS-X dataset.
Model en de es fr ko zh Overall
mBERT 1.000 0.411 0.459 0.423 0.266 0.337 0.483
XLM-Base 1.000 0.416 0.465 0.429 0.272 0.339 0.487
XLM-R-Base 1.000 0.406 0.455 0.418 0.262 0.333 0.479
Table 3: All consistency scores of token attributions in the XQuAD dataset.
Start Position of Answer
Model en ar de el es hi ru th tr vi zh Overall
mBERT 1.000 0.234 0.349 0.299 0.414 0.276 0.285 0.237 0.321 0.279 0.363 0.369
XLM-R-Base 1.000 0.233 0.343 0.295 0.409 0.275 0.282 0.235 0.319 0.279 0.356 0.366
End Position of Answer
Model en ar de el es hi ru th tr vi zh Overall
mBERT 1.000 0.235 0.350 0.299 0.415 0.278 0.285 0.237 0.320 0.281 0.364 0.369
XLM-R-Base 1.000 0.235 0.345 0.297 0.411 0.277 0.283 0.237 0.319 0.281 0.357 0.367
Table 1: All consistency scores of token attributions in the XNLI dataset.

3.2 Inconsistent Token Attributions

Tables 3, 3 and 3 demonstrate the consistency of token attributions between English and all languages in the XNLI, the PAWS-X, and the XQuAD, respectively. Obviously, most of the scores are below 0.5. For example, the best scores aside from English are around 0.35, 0.46, and 0.36 in the XNLI, the PAWS-X, and the XQuAD, respectively. This indicates that multi-lingual PLMs assign different attributions to multi-lingual synonyms. Moreover, while little distinctions are shown among multi-lingual PLMs, some gaps are found among languages. For example, the scores of all multi-lingual PLMs in zh (Chinese) are about 0.14 higher than these in ar (Arabic) in the XNLI.

3.3 Most Consistent Token Attributions Trained in Spanish

Language en de es fr ko zh Overall
en 0.944 0.868 0.887 0.882 0.732 0.799 0.852
de 0.936 0.878 0.888 0.893 0.758 0.808 0.860
es 0.935 0.875 0.901 0.899 0.774 0.808 0.865
fr 0.933 0.875 0.897 0.901 0.753 0.814 0.862
ko 0.907 0.854 0.859 0.862 0.807 0.828 0.853
zh 0.922 0.856 0.870 0.861 0.790 0.837 0.860
Table 5: All consistency scores of token attributions in the PAWS-X dataset for different source languages using the XLM-R-Base model.
Language en de es fr ko zh Overall
en 1 0.406 0.455 0.418 0.262 0.333 0.479
de 0.410 1 0.513 0.504 0.392 0.294 0.519
es 0.458 0.512 1 0.604 0.424 0.341 0.557
fr 0.419 0.502 0.602 1 0.410 0.315 0.541
ko 0.263 0.390 0.425 0.411 1 0.302 0.465
zh 0.331 0.291 0.340 0.315 0.302 1 0.430
Table 4: The test accuracy in the PAWS-X dataset for different source languages using the XLM-R-Base model.

Next, Tables 5 and 5 demonstrate the test accuracy and consistency scores of token attributions in the PAWS-X dataset for different source languages, respectively. Every row and column represent a source language and a target language, respectively. Spanish (es) achieves the most consistent token attributions: it not only performs well in close languages such as English (en), German (de), and French (fr), it also reaches a fair score in distant languages such as Korea (ko). The scores in French (fr) and German (de) are also better than the score in English.

3.4 Strong Correlations Between Performance and Consistency Scores

Figure 2: The correlation coefficients between performance and consistency scores in three datasets (The correlation coefficient of the XLM-Base model in the XQuAD dataset is unavailable).

Finally, the correlation coefficients between performance and consistency scores in three datasets and different multi-lingual PLMs are shown in Figure 2. It indicates that all multi-lingual PLMs demonstrate strong positive correlations. Moreover, the correlation coincides with the difficulty of the dataset: the simpler a dataset is, the stronger correlation it has.

4 Conclusion

In this work, we propose a CCTA framework to assess the consistency of token attributions of multi-lingual PLMs. Specifically, given parallel texts, token attributions (i.e. importance) are quantified by the state-of-the-art Layer-based Integrated Gradients (LIG) method. Then all tokens are aligned into a common comparable embedding space. Finally, the well-known earth mover’s similarity is utilized to measure the consistency scores. Experimental results in three downstream tasks show that PLMs achieve inconsistent token attributions.

References

  • J. Bjerva and I. Augenstein (2018) From phonology to syntax: unsupervised linguistic typology at different levels with language embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 907–916. Cited by: §1.
  • J. Bjerva and I. Augenstein (2021) Does typological blinding impede cross-lingual sharing?. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 480–486. Cited by: §1.
  • E. A. Chi, J. Hewitt, and C. D. Manning (2020) Finding universal grammatical relations in multilingual bert. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5564–5577. Cited by: §1.
  • R. Choenni and E. Shutova (2020) What does it mean to be language-agnostic? probing multilingual sentence encoders for typological properties. arXiv preprint arXiv:2009.12862. Cited by: §1.
  • A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, É. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov (2020) Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451. Cited by: §1.
  • A. CONNEAU and G. Lample (2019) Cross-lingual language model pretraining. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . External Links: Link Cited by: §1.
  • W. de Vries, A. van Cranenburgh, and M. Nissim (2020) What’s so special about bert’s layers? a closer look at the nlp pipeline in monolingual and multilingual models. In

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings

    ,
    pp. 4339–4350. Cited by: §1.
  • D. Edmiston (2020) A systematic analysis of morphological content in bert models for multiple languages. arXiv preprint arXiv:2004.03032. Cited by: §1.
  • F. L. Hitchcock (1941) The distribution of a product from several sources to numerous localities. Journal of mathematics and physics 20 (1-4), pp. 224–230. Cited by: §2.3.
  • J. Hu, S. Ruder, A. Siddhant, G. Neubig, O. Firat, and M. Johnson (2020) XTREME: a massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In

    International Conference on Machine Learning

    ,
    pp. 4411–4421. Cited by: §3.1.
  • K. Karthikeyan, Z. Wang, S. Mayhew, and D. Roth (2019) Cross-lingual ability of multilingual bert: an empirical study. In International Conference on Learning Representations, Cited by: §1.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR (Poster), Cited by: §3.1.
  • S. Kudugunta, A. Bapna, I. Caswell, and O. Firat (2019) Investigating multilingual nmt representations at scale. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 1565–1575. Cited by: §1.
  • A. Kulmizev, V. Ravishankar, M. Abdou, and J. Nivre (2020) Do neural language models show preferences for syntactic formalisms?. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4077–4091. Cited by: §1.
  • J. Libovickỳ, R. Rosa, and A. Fraser (2019) How language-neutral is multilingual bert?. arXiv preprint arXiv:1911.03310. Cited by: §1.
  • J. Libovickỳ, R. Rosa, and A. Fraser (2020) On the language neutrality of pre-trained multilingual representations. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 1663–1674. Cited by: §1.
  • N. F. Liu, M. Gardner, Y. Belinkov, M. E. Peters, and N. A. Smith (2019) Linguistic knowledge and transferability of contextual representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1073–1094. Cited by: §1.
  • S. Mitchell, M. OSullivan, and I. Dunning (2011) PuLP: a linear programming toolkit for python. The University of Auckland, Auckland, New Zealand, pp. 65. Cited by: §2.3.
  • A. Oncevay, B. Haddow, and A. Birch (2020) Bridging linguistic typology and multilingual machine translation with multi-view language representations. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2391–2406. Cited by: §1.
  • B. Patra, J. R. A. Moniz, S. Garg, M. R. Gormley, and G. Neubig (2019)

    Bilingual lexicon induction with semi-supervision in non-isometric embedding spaces

    .
    In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 184–193. Cited by: §1.
  • M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237. Cited by: §2.2.
  • T. Pires, E. Schlinger, and D. Garrette (2019) How multilingual is multilingual bert?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4996–5001. Cited by: §1.
  • T. Rama, L. Beinborn, and S. Eger (2020) Probing multilingual bert for genetic and typological signals. In Proceedings of the 28th International Conference on Computational Linguistics, pp. 1214–1228. Cited by: §1.
  • V. Ravishankar, M. Gökırmak, L. Øvrelid, and E. Velldal (2019a) Multilingual probing of deep pre-trained contextual encoders. In

    Proceedings of the First NLPL Workshop on Deep Learning for Natural Language Processing

    ,
    pp. 37–47. Cited by: §1.
  • V. Ravishankar, L. Øvrelid, and E. Velldal (2019b) Probing multilingual sentence representations with x-probe. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pp. 156–168. Cited by: §1.
  • S. Rönnqvist, J. Kanerva, T. Salakoski, and F. Ginter (2019) Is multilingual bert fluent in language generation?. In Proceedings of the First NLPL Workshop on Deep Learning for Natural Language Processing, pp. 29–36. Cited by: §1.
  • J. Singh, B. McCann, R. Socher, and C. Xiong (2019) BERT is not an interlingua and the bias of tokenization. In Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019), pp. 47–55. Cited by: §1.
  • A. Søgaard, S. Ruder, and I. Vulić (2018) On the limitations of unsupervised bilingual dictionary induction. arXiv preprint arXiv:1805.03620. Cited by: §1.
  • M. Sundararajan, A. Taly, and Q. Yan (2017) Axiomatic attribution for deep networks. In International Conference on Machine Learning, pp. 3319–3328. Cited by: §2.1.
  • I. Tenney, D. Das, and E. Pavlick (2019) BERT rediscovers the classical nlp pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4593–4601. Cited by: §1.
  • I. Vulić, E. M. Ponti, R. Litschko, G. Glavaš, and A. Korhonen (2020a) Probing pretrained language models for lexical semantics. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7222–7240. Cited by: §1.
  • I. Vulić, S. Ruder, and A. Søgaard (2020b) Are all good word vector spaces isomorphic?. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3178–3192. Cited by: §1.
  • S. Wu and M. Dredze (2019) Beto, bentz, becas: the surprising cross-lingual effectiveness of bert. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 833–844. Cited by: §1.