Learning Multiscale Transformer Models for Sequence Generation

06/19/2022
by   Bei Li, et al.
0

Multiscale feature hierarchies have been witnessed the success in the computer vision area. This further motivates researchers to design multiscale Transformer for natural language processing, mostly based on the self-attention mechanism. For example, restricting the receptive field across heads or extracting local fine-grained features via convolutions. However, most of existing works directly modeled local features but ignored the word-boundary information. This results in redundant and ambiguous attention distributions, which lacks of interpretability. In this work, we define those scales in different linguistic units, including sub-words, words and phrases. We built a multiscale Transformer model by establishing relationships among scales based on word-boundary information and phrase-level prior knowledge. The proposed Universal MultiScale Transformer, namely Umst, was evaluated on two sequence generation tasks. Notably, it yielded consistent performance gains over the strong baseline on several test sets without sacrificing the efficiency.

READ FULL TEXT
research
08/15/2023

Attention Is Not All You Need Anymore

In recent years, the popular Transformer architecture has achieved great...
research
11/13/2022

Demystify Self-Attention in Vision Transformers from a Semantic Perspective: Analysis and Application

Self-attention mechanisms, especially multi-head self-attention (MSA), h...
research
06/12/2019

A Multiscale Visualization of Attention in the Transformer Model

The Transformer is a sequence model that forgoes traditional recurrent a...
research
05/26/2023

TranSFormer: Slow-Fast Transformer for Machine Translation

Learning multiscale Transformer models has been evidenced as a viable ap...
research
07/19/2023

Improving Domain Generalization for Sound Classification with Sparse Frequency-Regularized Transformer

Sound classification models' performance suffers from generalizing on ou...
research
02/27/2023

SpeechFormer++: A Hierarchical Efficient Framework for Paralinguistic Speech Processing

Paralinguistic speech processing is important in addressing many issues,...
research
11/09/2009

Different goals in multiscale simulations and how to reach them

In this paper we sum up our works on multiscale programs, mainly simulat...

Please sign up or login with your details

Forgot password? Click here to reset