Data Movement Is All You Need: A Case Study on Optimizing Transformers

06/30/2020
by   Andrei Ivanov, et al.
45

Transformers have become widely used for language modeling and sequence learning tasks, and are one of the most important machine learning workloads today. Training one is a very compute-intensive task, often taking days or weeks, and significant attention has been given to optimizing transformers. Despite this, existing implementations do not efficiently utilize GPUs. We find that data movement is the key bottleneck when training. Due to Amdahl's Law and massive improvements in compute performance, training has now become memory-bound. Further, existing frameworks use suboptimal data layouts. Using these insights, we present a recipe for globally optimizing data movement in transformers. We reduce data movement by up to 22.91 1.30x performance improvement over state-of-the-art frameworks when training BERT. Our approach is applicable more broadly to optimizing deep neural networks, and offers insight into how to tackle emerging performance bottlenecks.

READ FULL TEXT

page 1

page 3

page 8

page 9

research
10/05/2021

Language Modeling using LMUs: 10x Better Data Efficiency or Improved Scaling Compared to Transformers

Recent studies have demonstrated that the performance of transformers on...
research
11/23/2022

TorchScale: Transformers at Scale

Large Transformers have achieved state-of-the-art performance across man...
research
05/17/2021

Pay Attention to MLPs

Transformers have become one of the most important architectural innovat...
research
04/18/2022

Communication Bounds for Convolutional Neural Networks

Convolutional neural networks (CNNs) are important in a wide variety of ...
research
01/21/2021

Clairvoyant Prefetching for Distributed Machine Learning I/O

I/O is emerging as a major bottleneck for machine learning training, esp...
research
06/21/2023

Iterated Piecewise Affine (IPA) Approximation for Language Modeling

In this work, we demonstrate the application of a simple first-order Tay...
research
06/01/2022

Transformer with Fourier Integral Attentions

Multi-head attention empowers the recent success of transformers, the st...

Please sign up or login with your details

Forgot password? Click here to reset