Fast Attention Requires Bounded Entries

by   Josh Alman, et al.

In modern machine learning, inner product attention computation is a fundamental task for training large language models such as Transformer, GPT-1, BERT, GPT-2, GPT-3 and ChatGPT. Formally, in this problem, one is given as input three matrices Q, K, V ∈ [-B,B]^n × d, and the goal is to construct the matrix Att(Q,K,V) := diag(A 1_n)^-1 A V ∈ℝ^n × d, where A = exp(QK^⊤/d) is the `attention matrix', and exp is applied entry-wise. Straightforward methods for this problem explicitly compute the n × n attention matrix A, and hence require time Ω(n^2) even when d = n^o(1) is small. In this paper, we investigate whether faster algorithms are possible by implicitly making use of the matrix A. We present two results, showing that there is a sharp transition at B = Θ(√(log n)). ∙ If d = O(log n) and B = o(√(log n)), there is an n^1+o(1) time algorithm to approximate Att(Q,K,V) up to 1/poly(n) additive error. ∙ If d = O(log n) and B = Θ (√(log n)), assuming the Strong Exponential Time Hypothesis from fine-grained complexity theory, it is impossible to approximate Att(Q,K,V) up to 1/poly(n) additive error in truly subquadratic time n^2 - Ω(1). This gives a theoretical explanation for the phenomenon observed in practice that attention computation is much more efficient when the input matrices have smaller entries.


page 1

page 2

page 3

page 4


Solving Regularized Exp, Cosh and Sinh Regression Problems

In modern machine learning, attention computation is a fundamental task ...

Polytopes with Bounded Integral Slack Matrices Have Sub-Exponential Extension Complexity

We show that any bounded integral function f : A × B ↦{0,1, …, Δ} with r...

Sample-Optimal Low-Rank Approximation of Distance Matrices

A distance matrix A ∈ R^n × m represents all pairwise distances, A_ij=d(...

On the Computational Complexity of Linear Discrepancy

Many problems in computer science and applied mathematics require roundi...

Fast TreeSHAP: Accelerating SHAP Value Computation for Trees

SHAP (SHapley Additive exPlanation) values are one of the leading tools ...

Counting Short Vector Pairs by Inner Product and Relations to the Permanent

Given as input two n-element sets 𝒜,ℬ⊆{0,1}^d with d=clog n≤(log n)^2/(l...

Single Pass Entrywise-Transformed Low Rank Approximation

In applications such as natural language processing or computer vision, ...

Please sign up or login with your details

Forgot password? Click here to reset