An Empirical Study of Leveraging Knowledge Distillation for Compressing Multilingual Neural Machine Translation Models

04/19/2023
by   Varun Gumma, et al.
4

Knowledge distillation (KD) is a well-known method for compressing neural models. However, works focusing on distilling knowledge from large multilingual neural machine translation (MNMT) models into smaller ones are practically nonexistent, despite the popularity and superiority of MNMT. This paper bridges this gap by presenting an empirical investigation of knowledge distillation for compressing MNMT models. We take Indic to English translation as a case study and demonstrate that commonly used language-agnostic and language-aware KD approaches yield models that are 4-5x smaller but also suffer from performance drops of up to 3.5 BLEU. To mitigate this, we then experiment with design considerations such as shallower versus deeper models, heavy parameter sharing, multi-stage training, and adapters. We observe that deeper compact models tend to be as good as shallower non-compact ones, and that fine-tuning a distilled model on a High-Quality subset slightly boosts translation quality. Overall, we conclude that compressing MNMT models via KD is challenging, indicating immense scope for further research.

READ FULL TEXT

page 1

page 8

research
02/27/2019

Multilingual Neural Machine Translation with Knowledge Distillation

Multilingual machine translation, which translates multiple languages wi...
research
10/19/2022

A baseline revisited: Pushing the limits of multi-segment models for context-aware translation

This paper addresses the task of contextual translation using multi-segm...
research
03/31/2023

Selective Knowledge Distillation for Non-Autoregressive Neural Machine Translation

Benefiting from the sequence-level knowledge distillation, the Non-Autor...
research
12/06/2022

Life-long Learning for Multilingual Neural Machine Translation with Knowledge Distillation

A common scenario of Multilingual Neural Machine Translation (MNMT) is t...
research
04/16/2021

Serial or Parallel? Plug-able Adapter for multilingual machine translation

Developing a unified multilingual translation model is a key topic in ma...
research
10/12/2020

Collective Wisdom: Improving Low-resource Neural Machine Translation using Adaptive Knowledge Distillation

Scarcity of parallel sentence-pairs poses a significant hurdle for train...
research
04/15/2020

Building a Multi-domain Neural Machine Translation Model using Knowledge Distillation

Lack of specialized data makes building a multi-domain neural machine tr...

Please sign up or login with your details

Forgot password? Click here to reset