EcoFormer: Energy-Saving Attention with Linear Complexity

09/19/2022
by   Jing Liu, et al.
0

Transformer is a transformative framework that models sequential data and has achieved remarkable performance on a wide range of tasks, but with high computational and energy cost. To improve its efficiency, a popular choice is to compress the models via binarization which constrains the floating-point values into binary ones to save resource consumption owing to cheap bitwise operations significantly. However, existing binarization methods only aim at minimizing the information loss for the input distribution statistically, while ignoring the pairwise similarity modeling at the core of the attention. To this end, we propose a new binarization paradigm customized to high-dimensional softmax attention via kernelized hashing, called EcoFormer, to map the original queries and keys into low-dimensional binary codes in Hamming space. The kernelized hash functions are learned to match the ground-truth similarity relations extracted from the attention map in a self-supervised way. Based on the equivalence between the inner product of binary codes and the Hamming distance as well as the associative property of matrix multiplication, we can approximate the attention in linear complexity by expressing it as a dot-product of binary codes. Moreover, the compact binary representations of queries and keys enable us to replace most of the expensive multiply-accumulate operations in attention with simple accumulations to save considerable on-chip energy footprint on edge devices. Extensive experiments on both vision and language tasks show that EcoFormer consistently achieves comparable performance with standard attentions while consuming much fewer resources. For example, based on PVTv2-B0 and ImageNet-1K, Ecoformer achieves a 73 footprint reduction with only a 0.33 attention. Code is available at https://github.com/ziplab/EcoFormer.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/01/2019

Central Similarity Hashing via Hadamard matrix

Hashing has been widely used for efficient large-scale multimedia data r...
research
11/29/2013

The Power of Asymmetry in Binary Hashing

When approximating binary similarity using the hamming distance between ...
research
06/10/2023

ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient Vision Transformer

Vision Transformers (ViTs) have shown impressive performance and have be...
research
07/17/2020

Self-Supervised Bernoulli Autoencoders for Semi-Supervised Hashing

Semantic hashing is an emerging technique for large-scale similarity sea...
research
02/22/2023

KS-DETR: Knowledge Sharing in Attention Learning for Detection Transformer

Scaled dot-product attention applies a softmax function on the scaled do...
research
07/04/2023

Spike-driven Transformer

Spiking Neural Networks (SNNs) provide an energy-efficient deep learning...
research
11/18/2022

Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention During Vision Transformer Inference

Vision Transformers (ViTs) have shown impressive performance but still r...

Please sign up or login with your details

Forgot password? Click here to reset