A Fast Optimization View: Reformulating Single Layer Attention in LLM Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time

09/14/2023
by   Yeqi Gao, et al.
0

Large language models (LLMs) have played a pivotal role in revolutionizing various facets of our daily existence. Solving attention regression is a fundamental task in optimizing LLMs. In this work, we focus on giving a provable guarantee for the one-layer attention network objective function L(X,Y) = ∑_j_0 = 1^n ∑_i_0 = 1^d ( ⟨⟨exp( 𝖠_j_0 x ) , 1_n ⟩^-1exp( 𝖠_j_0 x ), A_3 Y_*,i_0⟩ - b_j_0,i_0 )^2. Here 𝖠∈ℝ^n^2 × d^2 is Kronecker product between A_1 ∈ℝ^n × d and A_2 ∈ℝ^n × d. A_3 is a matrix in ℝ^n × d, 𝖠_j_0∈ℝ^n × d^2 is the j_0-th block of 𝖠. The X, Y ∈ℝ^d × d are variables we want to learn. B ∈ℝ^n × d and b_j_0,i_0∈ℝ is one entry at j_0-th row and i_0-th column of B, Y_*,i_0∈ℝ^d is the i_0-column vector of Y, and x ∈ℝ^d^2 is the vectorization of X. In a multi-layer LLM network, the matrix B ∈ℝ^n × d can be viewed as the output of a layer, and A_1= A_2 = A_3 ∈ℝ^n × d can be viewed as the input of a layer. The matrix version of x can be viewed as QK^⊤ and Y can be viewed as V. We provide an iterative greedy algorithm to train loss function L(X,Y) up ϵ that runs in O( ( T_mat(n,n,d) + T_mat(n,d,d) + d^2ω) log(1/ϵ) ) time. Here T_mat(a,b,c) denotes the time of multiplying a × b matrix another b × c matrix, and ω≈ 2.37 denotes the exponent of matrix multiplication.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset