Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View

by   Yiping Lu, et al.
Peking University

The Transformer architecture is widely used in natural language processing. Despite its success, the design principle of the Transformer remains elusive. In this paper, we provide a novel perspective towards understanding the architecture: we show that the Transformer can be mathematically interpreted as a numerical Ordinary Differential Equation (ODE) solver for a convection-diffusion equation in a multi-particle dynamic system. In particular, how words in a sentence are abstracted into contexts by passing through the layers of the Transformer can be interpreted as approximating multiple particles' movement in the space using the Lie-Trotter splitting scheme and the Euler's method. Given this ODE's perspective, the rich literature of numerical analysis can be brought to guide us in designing effective structures beyond the Transformer. As an example, we propose to replace the Lie-Trotter splitting scheme by the Strang-Marchuk splitting scheme, a scheme that is more commonly used and with much lower local truncation errors. The Strang-Marchuk splitting scheme suggests that the self-attention and position-wise feed-forward network (FFN) sub-layers should not be treated equally. Instead, in each layer, two position-wise FFN sub-layers should be used, and the self-attention sub-layer is placed in between. This leads to a brand new architecture. Such an FFN-attention-FFN layer is "Macaron-like", and thus we call the network with this new architecture the Macaron Net. Through extensive experiments, we show that the Macaron Net is superior to the Transformer on both supervised and unsupervised learning tasks. The reproducible codes and pretrained models can be found at https://github.com/zhuohan123/macaron-net


Augmenting Self-attention with Persistent Memory

Transformer networks have lead to important progress in language modelin...

Video Super-Resolution Transformer

Video super-resolution (VSR), with the aim to restore a high-resolution ...

IOT: Instance-wise Layer Reordering for Transformer Structures

With sequentially stacked self-attention, (optional) encoder-decoder att...

A Neural ODE Interpretation of Transformer Layers

Transformer layers, which use an alternating pattern of multi-head atten...

ODE Transformer: An Ordinary Differential Equation-Inspired Model for Neural Machine Translation

It has been found that residual networks are an Euler discretization of ...

Join-Chain Network: A Logical Reasoning View of the Multi-head Attention in Transformer

Developing neural architectures that are capable of logical reasoning ha...

Code Repositories


Codes for "Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View"

view repo

Please sign up or login with your details

Forgot password? Click here to reset