Optimizing Inference Performance of Transformers on CPUs

02/12/2021
by   Dave Dice, et al.
0

The Transformer architecture revolutionized the field of natural language processing (NLP). Transformers-based models (e.g., BERT) power many important Web services, such as search, translation, question-answering, etc. While enormous research attention is paid to the training of those models, relatively little efforts are made to improve their inference performance. This paper comes to address this gap by presenting an empirical analysis of scalability and performance of inferencing a Transformer-based model on CPUs. Focusing on the highly popular BERT model, we identify key components of the Transformer architecture where the bulk of the computation happens, and propose three optimizations to speed them up. The optimizations are evaluated using the inference benchmark from HuggingFace, and are shown to achieve the speedup of up to x2.37. The considered optimizations do not require any changes to the implementation of the models nor affect their accuracy.

READ FULL TEXT

page 3

page 4

research
02/27/2020

Compressing Large-Scale Transformer-Based Models: A Case Study on BERT

Transformer-based models pre-trained on large-scale corpora achieve stat...
research
06/22/2022

Answer Fast: Accelerating BERT on the Tensor Streaming Processor

Transformers have become a predominant machine learning workload, they a...
research
08/15/2020

Finding Fast Transformers: One-Shot Neural Architecture Search by Component Composition

Transformer-based models have achieved stateof-the-art results in many t...
research
06/02/2021

On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers

How much information do NLP tasks really need from a transformer's atten...
research
05/05/2020

The Cascade Transformer: an Application for Efficient Answer Sentence Selection

Large transformer-based language models have been shown to be very effec...
research
05/03/2023

Cheaply Evaluating Inference Efficiency Metrics for Autoregressive Transformer APIs

Large language models (LLMs) power many state-of-the-art systems in natu...
research
05/28/2023

Key-Value Transformer

Transformers have emerged as the prevailing standard solution for variou...

Please sign up or login with your details

Forgot password? Click here to reset