X-Former: In-Memory Acceleration of Transformers

03/13/2023
by   Shrihari Sridharan, et al.
0

Transformers have achieved great success in a wide variety of natural language processing (NLP) tasks due to the attention mechanism, which assigns an importance score for every word relative to other words in a sequence. However, these models are very large, often reaching hundreds of billions of parameters, and therefore require a large number of DRAM accesses. Hence, traditional deep neural network (DNN) accelerators such as GPUs and TPUs face limitations in processing Transformers efficiently. In-memory accelerators based on non-volatile memory promise to be an effective solution to this challenge, since they provide high storage density while performing massively parallel matrix vector multiplications within memory arrays. However, attention score computations, which are frequently used in Transformers (unlike CNNs and RNNs), require matrix vector multiplications (MVM) where both operands change dynamically for each input. As a result, conventional NVM-based accelerators incur high write latency and write energy when used for Transformers, and further suffer from the low endurance of most NVM technologies. To address these challenges, we present X-Former, a hybrid in-memory hardware accelerator that consists of both NVM and CMOS processing elements to execute transformer workloads efficiently. To improve the hardware utilization of X-Former, we also propose a sequence blocking dataflow, which overlaps the computations of the two processing elements and reduces execution time. Across several benchmarks, we show that X-Former achieves upto 85x and 7.5x improvements in latency and energy over a NVIDIA GeForce GTX 1060 GPU and upto 10.7x and 4.6x improvements in latency and energy over a state-of-the-art in-memory NVM accelerator.

READ FULL TEXT

page 1

page 9

research
03/22/2023

TRON: Transformer Neural Network Acceleration with Non-Coherent Silicon Photonics

Transformer neural networks are rapidly being integrated into state-of-t...
research
05/31/2023

DOTA: A Dynamically-Operated Photonic Tensor Core for Energy-Efficient Transformer Accelerator

The wide adoption and significant computing resource consumption of atte...
research
02/22/2020

A^3: Accelerating Attention Mechanisms in Neural Networks with Approximation

With the increasing computational demands of neural networks, many hardw...
research
02/28/2023

AccelTran: A Sparsity-Aware Accelerator for Dynamic Inference with Transformers

Self-attention-based transformer models have achieved tremendous success...
research
05/09/2022

Row-wise Accelerator for Vision Transformer

Following the success of the natural language processing, the transforme...
research
10/18/2022

ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design

Vision Transformers (ViTs) have achieved state-of-the-art performance on...
research
02/16/2023

Hardware-aware training for large-scale and diverse deep learning inference workloads using in-memory computing-based accelerators

Analog in-memory computing (AIMC) – a promising approach for energy-effi...

Please sign up or login with your details

Forgot password? Click here to reset