Character-Level Language Modeling with Deeper Self-Attention

08/09/2018
by   Rami Al-Rfou, et al.
0

LSTMs and other RNN variants have shown strong performance on character-level language modeling. These models are typically trained using truncated backpropagation through time, and it is common to assume that their success stems from their ability to remember long-term contexts. In this paper, we show that a deep (64-layer) transformer model with fixed context outperforms RNN variants by a large margin, achieving state of the art on two popular benchmarks- 1.13 bits per character on text8 and 1.06 on enwik8. To get good results at this depth, we show that it is important to add auxiliary losses, both at intermediate network layers and intermediate sequence positions.

READ FULL TEXT
research
07/02/2019

Augmenting Self-attention with Persistent Memory

Transformer networks have lead to important progress in language modelin...
research
11/17/2020

Cascade RNN-Transducer: Syllable Based Streaming On-device Mandarin Speech Recognition with a Syllable-to-Character Converter

End-to-end models are favored in automatic speech recognition (ASR) beca...
research
10/26/2017

Rotational Unit of Memory

The concepts of unitary evolution matrices and associative memory have b...
research
10/15/2018

Trellis Networks for Sequence Modeling

We present trellis networks, a new architecture for sequence modeling. O...
research
04/08/2021

Revisiting Simple Neural Probabilistic Language Models

Recent progress in language modeling has been driven not only by advance...
research
11/19/2015

Alternative structures for character-level RNNs

Recurrent neural networks are convenient and efficient models for langua...
research
12/02/2019

Neural Academic Paper Generation

In this work, we tackle the problem of structured text generation, speci...

Please sign up or login with your details

Forgot password? Click here to reset