Answer Fast: Accelerating BERT on the Tensor Streaming Processor

06/22/2022
by   Ibrahim Ahmed, et al.
0

Transformers have become a predominant machine learning workload, they are not only the de-facto standard for natural language processing tasks, but they are also being deployed in other domains such as vision and speech recognition. Many of the transformer-based applications are real-time systems such as machine translation and web search. These real time systems often come with strict end-to-end inference latency requirements. Unfortunately, while the majority of the transformer computation comes from matrix multiplications, transformers also include several non-linear components that tend to become the bottleneck during an inference. In this work, we accelerate the inference of BERT models on the tensor streaming processor. By carefully fusing all the nonlinear components with the matrix multiplication components, we are able to efficiently utilize the on-chip matrix multiplication units resulting in a deterministic tail latency of 130 μs for a batch-1 inference through BERT-base, which is 6X faster than the current state-of-the-art.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/12/2021

Optimizing Inference Performance of Transformers on CPUs

The Transformer architecture revolutionized the field of natural languag...
research
03/16/2023

Block-wise Bit-Compression of Transformer-based Models

With the popularity of the recent Transformer-based models represented b...
research
10/22/2020

Developing Real-time Streaming Transformer Transducer for Speech Recognition on Large-scale Dataset

Recently, Transformer based end-to-end models have achieved great succes...
research
05/02/2022

Teaching BERT to Wait: Balancing Accuracy and Latency for Streaming Disfluency Detection

In modern interactive speech-based systems, speech is consumed and trans...
research
06/28/2023

An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs

In recent years, Transformer-based language models have become the stand...
research
04/13/2021

NPE: An FPGA-based Overlay Processor for Natural Language Processing

In recent years, transformer-based models have shown state-of-the-art re...
research
07/21/2020

SliceOut: Training Transformers and CNNs faster while using less memory

We demonstrate 10-40 EfficientNets, and Transformer models, with minimal...

Please sign up or login with your details

Forgot password? Click here to reset