A Fast Optimization View: Reformulating Single Layer Attention in LLM Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time

09/14/2023
by   Yeqi Gao, et al.
0

Large language models (LLMs) have played a pivotal role in revolutionizing various facets of our daily existence. Solving attention regression is a fundamental task in optimizing LLMs. In this work, we focus on giving a provable guarantee for the one-layer attention network objective function L(X,Y) = ∑_j_0 = 1^n ∑_i_0 = 1^d ( ⟨⟨exp( 𝖠_j_0 x ) , 1_n ⟩^-1exp( 𝖠_j_0 x ), A_3 Y_*,i_0⟩ - b_j_0,i_0 )^2. Here 𝖠∈ℝ^n^2 × d^2 is Kronecker product between A_1 ∈ℝ^n × d and A_2 ∈ℝ^n × d. A_3 is a matrix in ℝ^n × d, 𝖠_j_0∈ℝ^n × d^2 is the j_0-th block of 𝖠. The X, Y ∈ℝ^d × d are variables we want to learn. B ∈ℝ^n × d and b_j_0,i_0∈ℝ is one entry at j_0-th row and i_0-th column of B, Y_*,i_0∈ℝ^d is the i_0-column vector of Y, and x ∈ℝ^d^2 is the vectorization of X. In a multi-layer LLM network, the matrix B ∈ℝ^n × d can be viewed as the output of a layer, and A_1= A_2 = A_3 ∈ℝ^n × d can be viewed as the input of a layer. The matrix version of x can be viewed as QK^⊤ and Y can be viewed as V. We provide an iterative greedy algorithm to train loss function L(X,Y) up ϵ that runs in O( ( T_mat(n,n,d) + T_mat(n,d,d) + d^2ω) log(1/ϵ) ) time. Here T_mat(a,b,c) denotes the time of multiplying a × b matrix another b × c matrix, and ω≈ 2.37 denotes the exponent of matrix multiplication.

READ FULL TEXT
research
05/26/2023

Kaczmarz-Type Method for Solving Matrix Equation AXB=C

In this paper, several row and column orthogonal projection methods are ...
research
07/13/2023

Faster Rectangular Matrix Multiplication by Combination Loss Analysis

Duan, Wu and Zhou (FOCS 2023) recently obtained the improved upper bound...
research
12/14/2021

Training Multi-Layer Over-Parametrized Neural Network in Subquadratic Time

We consider the problem of training a multi-layer over-parametrized neur...
research
04/10/2023

Randomized and Deterministic Attention Sparsification Algorithms for Over-parameterized Feature Dimension

Large language models (LLMs) have shown their power in different areas. ...
research
03/02/2022

Sparse matrix multiplication in the low-bandwidth model

We study matrix multiplication in the low-bandwidth model: There are n c...
research
04/05/2023

Algorithm and Hardness for Dynamic Attention Maintenance in Large Language Models

Large language models (LLMs) have made fundamental changes in human life...
research
01/22/2023

The Backpropagation algorithm for a math student

A Deep Neural Network (DNN) is a composite function of vector-valued fun...

Please sign up or login with your details

Forgot password? Click here to reset