Memory-efficient NLLB-200: Language-specific Expert Pruning of a Massively Multilingual Machine Translation Model

12/19/2022
by   Yeskendir Koishekenov, et al.
0

Compared to conventional bilingual translation systems, massively multilingual machine translation is appealing because a single model can translate into multiple languages and benefit from knowledge transfer for low resource languages. On the other hand, massively multilingual models suffer from the curse of multilinguality, unless scaling their size massively, which increases their training and inference costs. Sparse Mixture-of-Experts models are a way to drastically increase model capacity without the need for a proportional amount of computing. The recently released NLLB-200 is an example of such a model. It covers 202 languages but requires at least four 32GB GPUs just for inference. In this work, we propose a pruning method that allows the removal of up to 80% of experts with a negligible loss in translation quality, which makes it feasible to run the model on a single 32GB GPU. Further analysis suggests that our pruning metrics allow to identify language-specific experts and prune non-relevant experts for a given language pair.

READ FULL TEXT

page 12

page 13

research
05/12/2020

A Framework for Hierarchical Multilingual Machine Translation

Multilingual machine translation has recently been in vogue given its po...
research
09/22/2021

Scalable and Efficient MoE Training for Multitask Multilingual Models

The Mixture of Experts (MoE) models are an emerging class of sparsely ac...
research
12/15/2022

Fixing MoE Over-Fitting on Low-Resource Languages in Multilingual Machine Translation

Sparsely gated Mixture of Experts (MoE) models have been shown to be a c...
research
10/21/2020

Beyond English-Centric Multilingual Machine Translation

Existing work in translation demonstrated the potential of massively mul...
research
12/16/2021

Can Multilinguality benefit Non-autoregressive Machine Translation?

Non-autoregressive (NAR) machine translation has recently achieved signi...
research
05/03/2023

Towards Being Parameter-Efficient: A Stratified Sparsely Activated Transformer with Dynamic Capacity

Mixture-of-experts (MoE) models that employ sparse activation have demon...
research
05/23/2023

Condensing Multilingual Knowledge with Lightweight Language-Specific Modules

Incorporating language-specific (LS) modules is a proven method to boost...

Please sign up or login with your details

Forgot password? Click here to reset