Training Large Language Models Efficiently with Sparsity and Dataflow

04/11/2023
by   Venkat Srinivasan, et al.
0

Large foundation language models have shown their versatility in being able to be adapted to perform a wide variety of downstream tasks, such as text generation, sentiment analysis, semantic search etc. However, training such large foundational models is a non-trivial exercise that requires a significant amount of compute power and expertise from machine learning and systems experts. As models get larger, these demands are only increasing. Sparsity is a promising technique to relieve the compute requirements for training. However, sparsity introduces new challenges in training the sparse model to the same quality as the dense counterparts. Furthermore, sparsity drops the operation intensity and introduces irregular memory access patterns that makes it challenging to efficiently utilize compute resources. This paper demonstrates an end-to-end training flow on a large language model - 13 billion GPT - using sparsity and dataflow. The dataflow execution model and architecture enables efficient on-chip irregular memory accesses as well as native kernel fusion and pipelined parallelism that helps recover device utilization. We show that we can successfully train GPT 13B to the same quality as the dense GPT 13B model, while achieving an end-end speedup of 4.5x over dense A100 baseline.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/21/2019

Paraphrasing with Large Language Models

Recently, large language models such as GPT-2 have shown themselves to b...
research
07/02/2023

PatternGPT :A Pattern-Driven Framework for Large Language Model Text Generation

Large language models(LLMS) have shown excellent text generation capabil...
research
05/23/2022

Parameter-Efficient Sparsity for Large Language Models Fine-Tuning

With the dramatically increased number of parameters in language models,...
research
01/14/2022

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

As the training of giant dense models hits the boundary on the availabil...
research
11/08/2017

Block-Sparse Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are used in state-of-the-art models in ...
research
03/24/2023

Scaling Expert Language Models with Unsupervised Domain Discovery

Large language models are typically trained densely: all parameters are ...
research
08/15/2020

Breaking Barriers: Maximizing Array Utilization for Compute In-Memory Fabrics

Compute in-memory (CIM) is a promising technique that minimizes data tra...

Please sign up or login with your details

Forgot password? Click here to reset