Single Headed Attention RNN: Stop Thinking With Your Head

11/26/2019
by   Stephen Merity, et al.
0

The leading approaches in language modeling are all obsessed with TV shows of my youth - namely Transformers and Sesame Street. Transformers this, Transformers that, and over here a bonfire worth of GPU-TPU-neuromorphic wafer scale silicon. We opt for the lazy path of old and proven techniques with a fancy crypto inspired acronym: the Single Headed Attention RNN (SHA-RNN). The author's lone goal is to show that the entire field might have evolved a different direction if we had instead been obsessed with a slightly different acronym and slightly different result. We take a previously strong language model based only on boring LSTMs and get it to within a stone's throw of a stone's throw of state-of-the-art byte level language model results on enwik8. We also achieve state-of-the-art on WikiText-103 - or do we? This work has undergone no intensive hyperparameter optimization and lived entirely on a commodity desktop machine that made the author's small studio apartment far too warm in the midst of a San Franciscan summer. The final results are achievable in plus or minus 24 hours on a single GPU as the author is impatient. The attention mechanism is also readily extended to large contexts and requires minimal computation. Take that Sesame Street.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/22/2018

An Analysis of Neural Language Modeling at Multiple Scales

Many of the leading approaches in language modeling introduce novel, com...
research
08/18/2021

SHAQ: Single Headed Attention with Quasi-Recurrence

Natural Language Processing research has recently been dominated by larg...
research
09/30/2021

SlovakBERT: Slovak Masked Language Model

We introduce a new Slovak masked language model called SlovakBERT in thi...
research
09/04/2019

PaLM: A Hybrid Parser and Language Model

We present PaLM, a hybrid parser and neural language model. Building on ...
research
05/19/2019

Adaptive Attention Span in Transformers

We propose a novel self-attention mechanism that can learn its optimal a...
research
02/21/2022

Transformer Quality in Linear Time

We revisit the design choices in Transformers, and propose methods to ad...
research
09/19/2021

Unified and Multilingual Author Profiling for Detecting Haters

This paper presents a unified user profiling framework to identify hate ...

Please sign up or login with your details

Forgot password? Click here to reset