GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

06/30/2020
by   Dmitry Lepikhin, et al.
7

Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/05/2022

Impact of Domain-Adapted Multilingual Neural Machine Translation in the Medical Domain

Multilingual Neural Machine Translation (MNMT) models leverage many lang...
research
09/10/2018

Towards one-shot learning for rare-word translation with external experts

Neural machine translation (NMT) has significantly improved the quality ...
research
06/27/2023

Constructing Multilingual Code Search Dataset Using Neural Machine Translation

Code search is a task to find programming codes that semantically match ...
research
10/20/2020

Complete Multilingual Neural Machine Translation

Multilingual Neural Machine Translation (MNMT) models are commonly train...
research
02/19/2023

Scaling Laws for Multilingual Neural Machine Translation

In this work, we provide a large-scale empirical study of the scaling pr...
research
01/23/2017

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

The capacity of a neural network to absorb information is limited by its...
research
09/22/2021

Scalable and Efficient MoE Training for Multitask Multilingual Models

The Mixture of Experts (MoE) models are an emerging class of sparsely ac...

Please sign up or login with your details

Forgot password? Click here to reset