DeepAI AI Chat
Log In Sign Up

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

06/30/2020
by   Dmitry Lepikhin, et al.
Google
7

Neural network scaling has been critical for improving the model quality in many real-world machine learning applications with vast amounts of training data and compute. Although this trend of scaling is affirmed to be a sure-fire approach for better model quality, there are challenges on the path such as the computation cost, ease of programming, and efficient implementation on parallel devices. GShard is a module composed of a set of lightweight annotation APIs and an extension to the XLA compiler. It provides an elegant way to express a wide range of parallel computation patterns with minimal changes to the existing model code. GShard enabled us to scale up multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts beyond 600 billion parameters using automatic sharding. We demonstrate that such a giant model can efficiently be trained on 2048 TPU v3 accelerators in 4 days to achieve far superior quality for translation from 100 languages to English compared to the prior art.

READ FULL TEXT

page 1

page 2

page 3

page 4

12/05/2022

Impact of Domain-Adapted Multilingual Neural Machine Translation in the Medical Domain

Multilingual Neural Machine Translation (MNMT) models leverage many lang...
09/10/2018

Towards one-shot learning for rare-word translation with external experts

Neural machine translation (NMT) has significantly improved the quality ...
06/27/2023

Constructing Multilingual Code Search Dataset Using Neural Machine Translation

Code search is a task to find programming codes that semantically match ...
10/20/2020

Complete Multilingual Neural Machine Translation

Multilingual Neural Machine Translation (MNMT) models are commonly train...
02/19/2023

Scaling Laws for Multilingual Neural Machine Translation

In this work, we provide a large-scale empirical study of the scaling pr...
01/23/2017

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

The capacity of a neural network to absorb information is limited by its...
09/22/2021

Scalable and Efficient MoE Training for Multitask Multilingual Models

The Mixture of Experts (MoE) models are an emerging class of sparsely ac...

Code Repositories

mixture-of-experts

A Pytorch implementation of Sparsely-Gated Mixture of Experts, for massively increasing the parameter count of language models


view repo