A Fast Optimization View: Reformulating Single Layer Attention in LLM Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time

by   Yeqi Gao, et al.

Large language models (LLMs) have played a pivotal role in revolutionizing various facets of our daily existence. Solving attention regression is a fundamental task in optimizing LLMs. In this work, we focus on giving a provable guarantee for the one-layer attention network objective function L(X,Y) = ∑_j_0 = 1^n ∑_i_0 = 1^d ( ⟨⟨exp( 𝖠_j_0 x ) , 1_n ⟩^-1exp( 𝖠_j_0 x ), A_3 Y_*,i_0⟩ - b_j_0,i_0 )^2. Here 𝖠∈ℝ^n^2 × d^2 is Kronecker product between A_1 ∈ℝ^n × d and A_2 ∈ℝ^n × d. A_3 is a matrix in ℝ^n × d, 𝖠_j_0∈ℝ^n × d^2 is the j_0-th block of 𝖠. The X, Y ∈ℝ^d × d are variables we want to learn. B ∈ℝ^n × d and b_j_0,i_0∈ℝ is one entry at j_0-th row and i_0-th column of B, Y_*,i_0∈ℝ^d is the i_0-column vector of Y, and x ∈ℝ^d^2 is the vectorization of X. In a multi-layer LLM network, the matrix B ∈ℝ^n × d can be viewed as the output of a layer, and A_1= A_2 = A_3 ∈ℝ^n × d can be viewed as the input of a layer. The matrix version of x can be viewed as QK^⊤ and Y can be viewed as V. We provide an iterative greedy algorithm to train loss function L(X,Y) up ϵ that runs in O( ( T_mat(n,n,d) + T_mat(n,d,d) + d^2ω) log(1/ϵ) ) time. Here T_mat(a,b,c) denotes the time of multiplying a × b matrix another b × c matrix, and ω≈ 2.37 denotes the exponent of matrix multiplication.


Kaczmarz-Type Method for Solving Matrix Equation AXB=C

In this paper, several row and column orthogonal projection methods are ...

Faster Rectangular Matrix Multiplication by Combination Loss Analysis

Duan, Wu and Zhou (FOCS 2023) recently obtained the improved upper bound...

Training Multi-Layer Over-Parametrized Neural Network in Subquadratic Time

We consider the problem of training a multi-layer over-parametrized neur...

Randomized and Deterministic Attention Sparsification Algorithms for Over-parameterized Feature Dimension

Large language models (LLMs) have shown their power in different areas. ...

Sparse matrix multiplication in the low-bandwidth model

We study matrix multiplication in the low-bandwidth model: There are n c...

Algorithm and Hardness for Dynamic Attention Maintenance in Large Language Models

Large language models (LLMs) have made fundamental changes in human life...

The Backpropagation algorithm for a math student

A Deep Neural Network (DNN) is a composite function of vector-valued fun...

Please sign up or login with your details

Forgot password? Click here to reset