Multiscale sequence modeling with a learned dictionary

07/03/2017
by   Bart van Merriënboer, et al.
0

We propose a generalization of neural network sequence models. Instead of predicting one symbol at a time, our multi-scale model makes predictions over multiple, potentially overlapping multi-symbol tokens. A variation of the byte-pair encoding (BPE) compression algorithm is used to learn the dictionary of tokens that the model is trained with. When applied to language modelling, our model has the flexibility of character-level models while maintaining many of the performance benefits of word-level models. Our experiments show that this model performs better than a regular LSTM on language modeling tasks, especially for smaller models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/13/2018

Neural Lattice Language Models

In this work, we propose a new language modeling paradigm that has the a...
research
03/01/2022

"Is Whole Word Masking Always Better for Chinese BERT?": Probing on Chinese Grammatical Error Correction

Whole word masking (WWM), which masks all subwords corresponding to a wo...
research
08/25/2021

Models In a Spelling Bee: Language Models Implicitly Learn the Character Composition of Tokens

Standard pretrained language models operate on sequences of subword toke...
research
05/10/2019

Language Modeling with Deep Transformers

We explore multi-layer autoregressive Transformer models in language mod...
research
05/11/2021

Restoring Hebrew Diacritics Without a Dictionary

We demonstrate that it is feasible to diacritize Hebrew script without a...
research
03/10/2018

Learning and analyzing vector encoding of symbolic representations

We present a formal language with expressions denoting general symbol st...
research
10/22/2020

Autoregressive Modeling is Misspecified for Some Sequence Distributions

Should sequences be modeled autoregressively—one symbol at a time? How m...

Please sign up or login with your details

Forgot password? Click here to reset