DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation

09/22/2022
by   Seongmin Hong, et al.
0

Transformer is a deep learning language model widely used for natural language processing (NLP) services in datacenters. Among transformer models, Generative Pre-trained Transformer (GPT) has achieved remarkable performance in text generation, or natural language generation (NLG), which needs the processing of a large input context in the summarization stage, followed by the generation stage that produces a single word at a time. The conventional platforms such as GPU are specialized for the parallel processing of large inputs in the summarization stage, but their performance significantly degrades in the generation stage due to its sequential characteristic. Therefore, an efficient hardware platform is required to address the high latency caused by the sequential characteristic of text generation. In this paper, we present DFX, a multi-FPGA acceleration appliance that executes GPT-2 model inference end-to-end with low latency and high throughput in both summarization and generation stages. DFX uses model parallelism and optimized dataflow that is model-and-hardware-aware for fast simultaneous workload execution among devices. Its compute cores operate on custom instructions and provide GPT-2 operations end-to-end. We implement the proposed hardware architecture on four Xilinx Alveo U280 FPGAs and utilize all of the channels of the high bandwidth memory (HBM) and the maximum number of compute resources for high hardware efficiency. DFX achieves 5.58x speedup and 3.99x energy efficiency over four NVIDIA V100 GPUs on the modern GPT-2 model. DFX is also 8.21x more cost-effective than the GPU appliance, suggesting that it is a promising solution for text generation workloads in cloud datacenters.

READ FULL TEXT

page 1

page 5

page 7

page 10

page 11

research
07/16/2020

FTRANS: Energy-Efficient Acceleration of Transformers using FPGA

In natural language processing (NLP), the "Transformer" architecture was...
research
02/15/2023

Big Little Transformer Decoder

The recent emergence of Large Language Models based on the Transformer a...
research
04/13/2021

NPE: An FPGA-based Overlay Processor for Natural Language Processing

In recent years, transformer-based models have shown state-of-the-art re...
research
05/25/2023

MERGE: Fast Private Text Generation

Recent years have seen increasing concerns about the private inference o...
research
05/03/2023

Cheaply Evaluating Inference Efficiency Metrics for Autoregressive Transformer APIs

Large language models (LLMs) power many state-of-the-art systems in natu...
research
04/13/2021

MELOPPR: Software/Hardware Co-design for Memory-efficient Low-latency Personalized PageRank

Personalized PageRank (PPR) is a graph algorithm that evaluates the impo...
research
07/13/2020

Paranoid Transformer: Reading Narrative of Madness as Computational Approach to Creativity

This papers revisits the receptive theory in context of computational cr...

Please sign up or login with your details

Forgot password? Click here to reset