Multi-node Bert-pretraining: Cost-efficient Approach

08/01/2020
by   Jiahuang Lin, et al.
25

Recently, large scale Transformer-based language models such as BERT, GPT-2, and XLNet have brought about exciting leaps in state-of-the-art results for many Natural Language Processing (NLP) tasks. One of the common trends in these recent models is a significant increase in model complexity, which introduces both more weights and computation. Moreover, with the advent of large-scale unsupervised datasets, training time is further extended due to the increased amount of data samples within a single training epoch. As a result, to train these models within a reasonable time, machine learning (ML) programmers often require advanced hardware setups such as the premium GPU-enabled NVIDIA DGX workstations or specialized accelerators such as Google's TPU Pods. Our work addresses this limitation and demonstrates that the BERT pre-trained model can be trained within 2 weeks on an academic-size cluster of widely available GPUs through careful algorithmic and software optimizations. In this paper, we present these optimizations on how to improve single device training throughput, distribute the training workload over multiple nodes and GPUs, and overcome the communication bottleneck introduced by the large data exchanges over the network. We show that we are able to perform pre-training on BERT within a reasonable time budget (12 days) in an academic setting, but with a much less expensive and less aggressive hardware resource requirement than in previously demonstrated industrial settings based on NVIDIA DGX machines or Google's TPU Pods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/15/2021

How to Train BERT with an Academic Budget

While large language models à la BERT are used ubiquitously in NLP, pret...
research
10/14/2019

Q8BERT: Quantized 8Bit BERT

Recently, pre-trained Transformer based language models such as BERT and...
research
09/17/2019

Megatron-LM: Training Multi-Billion Parameter Language Models Using GPU Model Parallelism

Recent work in unsupervised language modeling demonstrates that training...
research
04/08/2020

Poor Man's BERT: Smaller and Faster Transformer Models

The ongoing neural revolution in Natural Language Processing has recentl...
research
08/17/2022

Boosting Distributed Training Performance of the Unpadded BERT Model

Pre-training models are an important tool in Natural Language Processing...
research
09/21/2019

Scale MLPerf-0.6 models on Google TPU-v3 Pods

The recent submission of Google TPU-v3 Pods to the industry wide MLPerf ...
research
11/07/2020

Exploring the limits of Concurrency in ML Training on Google TPUs

Recent results in language understanding using neural networks have requ...

Please sign up or login with your details

Forgot password? Click here to reset