FTRANS: Energy-Efficient Acceleration of Transformers using FPGA

07/16/2020
by   Bingbing Li, et al.
0

In natural language processing (NLP), the "Transformer" architecture was proposed as the first transduction model replying entirely on self-attention mechanisms without using sequence-aligned recurrent neural networks (RNNs) or convolution, and it achieved significant improvements for sequence to sequence tasks. The introduced intensive computation and storage of these pre-trained language representations has impeded their popularity into computation and memory-constrained devices. The field-programmable gate array (FPGA) is widely used to accelerate deep learning algorithms for its high parallelism and low latency. However, the trained models are still too large to accommodate to an FPGA fabric. In this paper, we propose an efficient acceleration framework, Ftrans, for transformer-based large scale language representations. Our framework includes enhanced block-circulant matrix (BCM)-based weight representation to enable model compression on large-scale language representations at the algorithm level with few accuracy degradation, and an acceleration design at the architecture level. Experimental results show that our proposed framework significantly reduces the model size of NLP models by up to 16 times. Our FPGA design achieves 27.07x and 81x improvement in performance and energy efficiency compared to CPU, and up to 8.80x improvement in energy efficiency compared to GPU.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 5

page 6

research
03/20/2018

Efficient Recurrent Neural Networks using Structured Matrices in FPGAs

Recurrent Neural Networks (RNNs) are becoming increasingly important for...
research
09/22/2022

DFX: A Low-latency Multi-FPGA Appliance for Accelerating Transformer-based Text Generation

Transformer is a deep learning language model widely used for natural la...
research
08/07/2022

A Length Adaptive Algorithm-Hardware Co-design of Transformer on FPGA Through Sparse Attention and Dynamic Pipelining

Transformers are considered one of the most important deep learning mode...
research
09/17/2019

A Data-Center FPGA Acceleration Platform for Convolutional Neural Networks

Intensive computation is entering data centers with multiple workloads o...
research
05/22/2019

KPynq: A Work-Efficient Triangle-Inequality based K-means on FPGA

K-means is a popular but computation-intensive algorithm for unsupervise...
research
07/27/2023

Scaling TransNormer to 175 Billion Parameters

We present TransNormerLLM, the first linear attention-based Large Langua...
research
05/21/2018

Streaming MANN: A Streaming-Based Inference for Energy-Efficient Memory-Augmented Neural Networks

With the successful development of artificial intelligence using deep le...

Please sign up or login with your details

Forgot password? Click here to reset