Fast Attention Requires Bounded Entries

02/26/2023
by   Josh Alman, et al.
0

In modern machine learning, inner product attention computation is a fundamental task for training large language models such as Transformer, GPT-1, BERT, GPT-2, GPT-3 and ChatGPT. Formally, in this problem, one is given as input three matrices Q, K, V ∈ [-B,B]^n × d, and the goal is to construct the matrix Att(Q,K,V) := diag(A 1_n)^-1 A V ∈ℝ^n × d, where A = exp(QK^⊤/d) is the `attention matrix', and exp is applied entry-wise. Straightforward methods for this problem explicitly compute the n × n attention matrix A, and hence require time Ω(n^2) even when d = n^o(1) is small. In this paper, we investigate whether faster algorithms are possible by implicitly making use of the matrix A. We present two results, showing that there is a sharp transition at B = Θ(√(log n)). ∙ If d = O(log n) and B = o(√(log n)), there is an n^1+o(1) time algorithm to approximate Att(Q,K,V) up to 1/poly(n) additive error. ∙ If d = O(log n) and B = Θ (√(log n)), assuming the Strong Exponential Time Hypothesis from fine-grained complexity theory, it is impossible to approximate Att(Q,K,V) up to 1/poly(n) additive error in truly subquadratic time n^2 - Ω(1). This gives a theoretical explanation for the phenomenon observed in practice that attention computation is much more efficient when the input matrices have smaller entries.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/28/2023

Solving Regularized Exp, Cosh and Sinh Regression Problems

In modern machine learning, attention computation is a fundamental task ...
research
07/30/2023

Polytopes with Bounded Integral Slack Matrices Have Sub-Exponential Extension Complexity

We show that any bounded integral function f : A × B ↦{0,1, …, Δ} with r...
research
06/02/2019

Sample-Optimal Low-Rank Approximation of Distance Matrices

A distance matrix A ∈ R^n × m represents all pairwise distances, A_ij=d(...
research
07/31/2020

On the Computational Complexity of Linear Discrepancy

Many problems in computer science and applied mathematics require roundi...
research
09/20/2021

Fast TreeSHAP: Accelerating SHAP Value Computation for Trees

SHAP (SHapley Additive exPlanation) values are one of the leading tools ...
research
07/28/2020

Counting Short Vector Pairs by Inner Product and Relations to the Permanent

Given as input two n-element sets 𝒜,ℬ⊆{0,1}^d with d=clog n≤(log n)^2/(l...
research
07/16/2021

Single Pass Entrywise-Transformed Low Rank Approximation

In applications such as natural language processing or computer vision, ...

Please sign up or login with your details

Forgot password? Click here to reset