A Tensorized Transformer for Language Modeling

06/24/2019
by   Xindian Ma, et al.
0

Latest development of neural models has connected the encoder and decoder through a self-attention mechanism. In particular, Transformer, which is solely based on self-attention, has led to breakthroughs in Natural Language Processing (NLP) tasks. However, the multi-head attention mechanism, as a key component of Transformer, limits the effective deployment of the model to a limited resource setting. In this paper, based on the ideas of tensor decomposition and parameters sharing, we propose a novel self-attention model (namely Multi-linear attention) with Block-Term Tensor Decomposition (BTD). We test and verify the proposed attention method on three language modeling tasks (i.e., PTB, WikiText-103 and One-billion) and a neural machine translation task (i.e., WMT-2016 English-German). Multi-linear attention can not only largely compress the model parameters but also obtain performance improvements, compared with a number of language modeling approaches, such as Transformer, Transformer-XL, and Transformer with tensor train decomposition.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/08/2020

Linformer: Self-Attention with Linear Complexity

Large transformer models have shown extraordinary success in achieving s...
research
12/03/2020

Do We Really Need That Many Parameters In Transformer For Extractive Summarization? Discourse Can Help !

The multi-head self-attention of popular transformer models is widely us...
research
11/11/2019

BP-Transformer: Modelling Long-Range Context via Binary Partitioning

The Transformer model is widely successful on many natural language proc...
research
12/25/2019

Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection

Self-attention based Transformer has demonstrated the state-of-the-art p...
research
05/02/2020

Synthesizer: Rethinking Self-Attention in Transformer Models

The dot product self-attention is known to be central and indispensable ...
research
06/15/2021

PairConnect: A Compute-Efficient MLP Alternative to Attention

Transformer models have demonstrated superior performance in natural lan...
research
09/26/2022

Fast-FNet: Accelerating Transformer Encoder Models via Efficient Fourier Layers

Transformer-based language models utilize the attention mechanism for su...

Please sign up or login with your details

Forgot password? Click here to reset