One-stop Training of Multiple Capacity Models for Multilingual Machine Translation

05/23/2023
by   Lan Jiang, et al.
0

Training models with varying capacities can be advantageous for deploying them in different scenarios. While high-capacity models offer better performance, low-capacity models require fewer computing resources for training and inference. In this work, we propose a novel one-stop training framework consisting of two composite model architectures and a joint training algorithm called Two-Stage Joint-Training (TSJT). Unlike knowledge distillation, where multiple capacity models are trained from scratch separately, our approach integrates supervisions from different flexible-capacity models simultaneously, leading to faster and more efficient convergence. Extensive experiments on the WMT10 benchmark show that our method outperforms low-capacity baseline models and achieves comparable or better performance on high-capacity models. Notably, the analysis demonstrates that our method significantly influences the initial training process, leading to more efficient convergence and superior solutions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/27/2019

Multilingual Neural Machine Translation with Knowledge Distillation

Multilingual machine translation, which translates multiple languages wi...
research
07/14/2022

Large-scale Knowledge Distillation with Elastic Heterogeneous Computing Resources

Although more layers and more parameters generally improve the accuracy ...
research
10/29/2021

Estimating and Maximizing Mutual Information for Knowledge Distillation

In this work, we propose Mutual Information Maximization Knowledge Disti...
research
10/19/2020

Revisiting Modularized Multilingual NMT to Meet Industrial Demands

The complete sharing of parameters for multilingual translation (1-1) ha...
research
03/25/2022

Single Model Ensemble for Subword Regularized Models in Low-Resource Machine Translation

Subword regularizations use multiple subword segmentations during traini...
research
10/15/2021

Tricks for Training Sparse Translation Models

Multi-task learning with an unbalanced data distribution skews model lea...

Please sign up or login with your details

Forgot password? Click here to reset