Multiresolution Transformer Networks: Recurrence is Not Essential for Modeling Hierarchical Structure

08/27/2019
by   Vikas K. Garg, et al.
18

The architecture of Transformer is based entirely on self-attention, and has been shown to outperform models that employ recurrence on sequence transduction tasks such as machine translation. The superior performance of Transformer has been attributed to propagating signals over shorter distances, between positions in the input and the output, compared to the recurrent architectures. We establish connections between the dynamics in Transformer and recurrent networks to argue that several factors including gradient flow along an ensemble of multiple weakly dependent paths play a paramount role in the success of Transformer. We then leverage the dynamics to introduce Multiresolution Transformer Networks as the first architecture that exploits hierarchical structure in data via self-attention. Our models significantly outperform state-of-the-art recurrent and hierarchical recurrent models on two real-world datasets for query suggestion, namely, and . In particular, on AOL data, our model registers at least 20% improvement on each precision score, and over 25% improvement on the BLEU score with respect to the best performing recurrent model. We thus provide strong evidence that recurrence is not essential for modeling hierarchical structure.

READ FULL TEXT
research
11/06/2017

Weighted Transformer Network for Machine Translation

State-of-the-art results on neural machine translation often use attenti...
research
03/06/2018

Self-Attention with Relative Position Representations

Relying entirely on an attention mechanism, the Transformer introduced b...
research
04/01/2021

Keyword Transformer: A Self-Attention Model for Keyword Spotting

The Transformer architecture has been successful across many domains, in...
research
03/30/2020

A Hierarchical Transformer for Unsupervised Parsing

The underlying structure of natural language is hierarchical; words comb...
research
10/13/2019

Transformer with Gaussian weighted self-attention for speech enhancement

The Transformer architecture recently replaced recurrent neural networks...
research
11/27/2019

DeFINE: DEep Factorized INput Word Embeddings for Neural Sequence Modeling

For sequence models with large word-level vocabularies, a majority of ne...
research
04/07/2023

Cleansing Jewel: A Neural Spelling Correction Model Built On Google OCR-ed Tibetan Manuscripts

Scholars in the humanities rely heavily on ancient manuscripts to study ...

Please sign up or login with your details

Forgot password? Click here to reset