Boosting Distributed Training Performance of the Unpadded BERT Model

08/17/2022
by   Jinle Zeng, et al.
0

Pre-training models are an important tool in Natural Language Processing (NLP), while the BERT model is a classic pre-training model whose structure has been widely adopted by followers. It was even chosen as the reference model for the MLPerf training benchmark. The distributed training performance optimization of BERT models plays an important role in accelerating the solutions of most NLP tasks. BERT model often uses padding tensors as its inputs, leading to excessive redundant computations. Thus, removing these redundant computations is essential to improve the distributed training performance. This paper designs a new approach to train BERT models with variable-length inputs efficiently. Firstly, we propose a general structure for the variable-length BERT models, and accelerate the encoder layer via our grouped multi-stream FMHA (Fused Multi-Head Attention) method. Secondly, through data exchange, we address the unbalanced workload problem caused by the variable-length inputs, which overlaps highly with the training process. Finally, we optimize the overall performance of the BERT model, such as kernel fusion, and operator optimization. Our experimental results show that our highly optimized BERT model achieves state-of-the-art throughput and ranks first in MLPerf Training v2.0 within the same GPU configuration. The optimizations in this paper can be applied to more BERT-like models in our future works.

READ FULL TEXT
research
10/06/2022

ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs

Transformer is the cornerstone model of Natural Language Processing (NLP...
research
04/14/2021

Demystifying BERT: Implications for Accelerator Design

Transfer learning in natural language processing (NLP), as realized usin...
research
09/13/2020

BoostingBERT:Integrating Multi-Class Boosting into BERT for NLP Tasks

As a pre-trained Transformer model, BERT (Bidirectional Encoder Represen...
research
08/01/2020

Multi-node Bert-pretraining: Cost-efficient Approach

Recently, large scale Transformer-based language models such as BERT, GP...
research
09/15/2021

Enhancing Clinical Information Extraction with Transferred Contextual Embeddings

The Bidirectional Encoder Representations from Transformers (BERT) model...
research
03/04/2021

Hardware Acceleration of Fully Quantized BERT for Efficient Natural Language Processing

BERT is the most recent Transformer-based model that achieves state-of-t...
research
10/19/2022

A Unified Neural Network Model for Readability Assessment with Feature Projection and Length-Balanced Loss

For readability assessment, traditional methods mainly employ machine le...

Please sign up or login with your details

Forgot password? Click here to reset