Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better than Dot-Product Self-Attention

04/22/2022
by   Tong Yu, et al.
0

Self-Attention is a widely used building block in neural modeling to mix long-range data elements. Most self-attention neural networks employ pairwise dot-products to specify the attention coefficients. However, these methods require O(N^2) computing cost for sequence length N. Even though some approximation methods have been introduced to relieve the quadratic cost, the performance of the dot-product approach is still bottlenecked by the low-rank constraint in the attention matrix factorization. In this paper, we propose a novel scalable and effective mixing building block called Paramixer. Our method factorizes the interaction matrix into several sparse matrices, where we parameterize the non-zero entries by MLPs with the data elements as input. The overall computing cost of the new building block is as low as O(N log N). Moreover, all factorizing matrices in Paramixer are full-rank, so it does not suffer from the low-rank bottleneck. We have tested the new method on both synthetic and various real-world long sequential data sets and compared it with several state-of-the-art attention networks. The experimental results show that Paramixer has better performance in most learning tasks.

READ FULL TEXT
research
09/16/2021

Sparse Factorization of Large Square Matrices

Square matrices appear in many machine learning problems and models. Opt...
research
07/03/2023

Butterfly factorization by algorithmic identification of rank-one blocks

Many matrices associated with fast transforms posess a certain low-rank ...
research
09/09/2021

Is Attention Better Than Matrix Decomposition?

As an essential ingredient of modern deep learning, attention mechanism,...
research
05/03/2015

Structured Block Basis Factorization for Scalable Kernel Matrix Evaluation

Kernel matrices are popular in machine learning and scientific computing...
research
06/07/2021

On the Expressive Power of Self-Attention Matrices

Transformer networks are able to capture patterns in data coming from ma...
research
06/12/2022

ChordMixer: A Scalable Neural Attention Model for Sequences with Different Lengths

Sequential data naturally have different lengths in many domains, with s...
research
05/04/2015

Self-Expressive Decompositions for Matrix Approximation and Clustering

Data-aware methods for dimensionality reduction and matrix decomposition...

Please sign up or login with your details

Forgot password? Click here to reset