DeepAI AI Chat
Log In Sign Up

Multi-level Distillation of Semantic Knowledge for Pre-training Multilingual Language Model

by   Mingqi Li, et al.
Clemson University
University at Buffalo

Pre-trained multilingual language models play an important role in cross-lingual natural language understanding tasks. However, existing methods did not focus on learning the semantic structure of representation, and thus could not optimize their performance. In this paper, we propose Multi-level Multilingual Knowledge Distillation (MMKD), a novel method for improving multilingual language models. Specifically, we employ a teacher-student framework to adopt rich semantic representation knowledge in English BERT. We propose token-, word-, sentence-, and structure-level alignment objectives to encourage multiple levels of consistency between source-target pairs and correlation similarity between teacher and student models. We conduct experiments on cross-lingual evaluation benchmarks including XNLI, PAWS-X, and XQuAD. Experimental results show that MMKD outperforms other baseline models of similar size on XNLI and XQuAD and obtains comparable performance on PAWS-X. Especially, MMKD obtains significant performance gains on low-resource languages.


page 1

page 2

page 3

page 4


Improving Cross-lingual Information Retrieval on Low-Resource Languages via Optimal Transport Distillation

Benefiting from transformer-based pre-trained language models, neural ra...

You Can Have Your Data and Balance It Too: Towards Balanced and Efficient Multilingual Models

Multilingual models have been widely used for cross-lingual transfer to ...

MergeDistill: Merging Pre-trained Language Models using Distillation

Pre-trained multilingual language models (LMs) have achieved state-of-th...

SKDBERT: Compressing BERT via Stochastic Knowledge Distillation

In this paper, we propose Stochastic Knowledge Distillation (SKD) to obt...

Multi-Level Knowledge Distillation for Out-of-Distribution Detection in Text

Self-supervised representation learning has proved to be a valuable comp...

Beyond English-Centric Bitexts for Better Multilingual Language Representation Learning

In this paper, we elaborate upon recipes for building multilingual repre...

Improving Neural Cross-Lingual Summarization via Employing Optimal Transport Distance for Knowledge Distillation

Current state-of-the-art cross-lingual summarization models employ multi...