MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning

11/17/2019
by   Guangxiang Zhao, et al.
0

In sequence to sequence learning, the self-attention mechanism proves to be highly effective, and achieves significant improvements in many tasks. However, the self-attention mechanism is not without its own flaws. Although self-attention can model extremely long dependencies, the attention in deep layers tends to overconcentrate on a single token, leading to insufficient use of local information and difficultly in representing long sequences. In this work, we explore parallel multi-scale representation learning on sequence data, striving to capture both long-range and short-range language structures. To this end, we propose the Parallel MUlti-Scale attEntion (MUSE) and MUSE-simple. MUSE-simple contains the basic idea of parallel multi-scale sequence representation learning, and it encodes the sequence in parallel, in terms of different scales with the help from self-attention, and pointwise transformation. MUSE builds on MUSE-simple and explores combining convolution and self-attention for learning sequence representations from more different scales. We focus on machine translation and the proposed approach achieves substantial performance improvements over Transformer, especially on long sequences. More importantly, we find that although conceptually simple, its success in practice requires intricate considerations, and the multi-scale attention must build on unified semantic space. Under common setting, the proposed model achieves substantial performance and outperforms all previous models on three main machine translation tasks. In addition, MUSE has potential for accelerating inference due to its parallelism. Code will be available at https://github.com/lancopku/MUSE

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/18/2021

You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling

Transformer-based models are widely used in natural language processing ...
research
09/30/2020

Learning Hard Retrieval Cross Attention for Transformer

The Transformer translation model that based on the multi-head attention...
research
11/11/2019

BP-Transformer: Modelling Long-Range Context via Binary Partitioning

The Transformer model is widely successful on many natural language proc...
research
06/28/2020

Self-Attention Networks for Intent Detection

Self-attention networks (SAN) have shown promising performance in variou...
research
06/08/2022

UHD Image Deblurring via Multi-scale Cubic-Mixer

Currently, transformer-based algorithms are making a splash in the domai...
research
05/31/2023

Recasting Self-Attention with Holographic Reduced Representations

In recent years, self-attention has become the dominant paradigm for seq...
research
07/18/2019

Neural Shuffle-Exchange Networks - Sequence Processing in O(n log n) Time

A key requirement in sequence to sequence processing is the modeling of ...

Please sign up or login with your details

Forgot password? Click here to reset