Gradient Vaccine: Investigating and Improving Multi-task Optimization in Massively Multilingual Models

10/12/2020
by   Zirui Wang, et al.
3

Massively multilingual models subsuming tens or even hundreds of languages pose great challenges to multi-task optimization. While it is a common practice to apply a language-agnostic procedure optimizing a joint multilingual task objective, how to properly characterize and take advantage of its underlying problem structure for improving optimization efficiency remains under-explored. In this paper, we attempt to peek into the black-box of multilingual optimization through the lens of loss function geometry. We find that gradient similarity measured along the optimization trajectory is an important signal, which correlates well with not only language proximity but also the overall model performance. Such observation helps us to identify a critical limitation of existing gradient-based multi-task learning methods, and thus we derive a simple and scalable optimization procedure, named Gradient Vaccine, which encourages more geometrically aligned parameter updates for close tasks. Empirically, our method obtains significant model performance gains on multilingual machine translation and XTREME benchmark tasks for multilingual language models. Our work reveals the importance of properly measuring and utilizing language proximity in multilingual optimization, and has broader implications for multi-task learning beyond multilingual modeling.

READ FULL TEXT

page 3

page 15

page 16

research
05/12/2022

Multi Task Learning For Zero Shot Performance Prediction of Multilingual Models

Massively Multilingual Transformer based Language Models have been obser...
research
01/19/2020

Gradient Surgery for Multi-Task Learning

While deep learning and deep reinforcement learning (RL) systems have de...
research
04/09/2020

Learning to Scale Multilingual Representations for Vision-Language Tasks

Current multilingual vision-language models either require a large numbe...
research
09/27/2016

Multi-task Recurrent Model for True Multilingual Speech Recognition

Research on multilingual speech recognition remains attractive yet chall...
research
10/15/2021

Tricks for Training Sparse Translation Models

Multi-task learning with an unbalanced data distribution skews model lea...
research
07/04/2017

Multilingual Hierarchical Attention Networks for Document Classification

Hierarchical attention networks have recently achieved remarkable perfor...
research
06/20/2023

GIO: Gradient Information Optimization for Training Dataset Selection

It is often advantageous to train models on a subset of the available tr...

Please sign up or login with your details

Forgot password? Click here to reset