DeepAI
Log In Sign Up

Multi-Scale Self-Attention for Text Classification

12/02/2019
by   Qipeng Guo, et al.
0

In this paper, we introduce the prior knowledge, multi-scale structure, into self-attention modules. We propose a Multi-Scale Transformer which uses multi-scale multi-head self-attention to capture features from different scales. Based on the linguistic perspective and the analysis of pre-trained Transformer (BERT) on a huge corpus, we further design a strategy to control the scale distribution for each layer. Results of three different kinds of tasks (21 datasets) show our Multi-Scale Transformer outperforms the standard Transformer consistently and significantly on small and moderate size datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

11/30/2021

Shunted Self-Attention via Multi-Scale Token Aggregation

Recent Vision Transformer (ViT) models have demonstrated encouraging res...
11/25/2022

Aggregated Text Transformer for Scene Text Detection

This paper explores the multi-scale aggregation strategy for scene text ...
04/21/2020

Attention Module is Not Only a Weight: Analyzing Transformers with Vector Norms

Because attention modules are core components of Transformer-based model...
02/13/2022

DEEPCHORUS: A Hybrid Model of Multi-scale Convolution and Self-attention for Chorus Detection

Chorus detection is a challenging problem in musical signal processing a...
06/08/2022

UHD Image Deblurring via Multi-scale Cubic-Mixer

Currently, transformer-based algorithms are making a splash in the domai...
03/24/2022

Beyond Fixation: Dynamic Window Visual Transformer

Recently, a surge of interest in visual transformers is to reduce the co...