Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling

04/23/2017
by   Kazuya Kawakami, et al.
0

Fixed-vocabulary language models fail to account for one of the most characteristic statistical facts of natural language: the frequent creation and reuse of new word types. Although character-level language models offer a partial solution in that they can create word types not attested in the training corpus, they do not capture the "bursty" distribution of such words. In this paper, we augment a hierarchical LSTM language model that generates sequences of word tokens character by character with a caching mechanism that learns to reuse previously generated words. To validate our model we construct a new open-vocabulary language modeling corpus (the Multilingual Wikipedia Corpus, MWC) from comparable Wikipedia articles in 7 typologically diverse languages and demonstrate the effectiveness of our model across this range of languages.

READ FULL TEXT
research
04/10/2017

Character-Word LSTM Language Models

We present a Character-Word Long Short-Term Memory Language Model which ...
research
06/04/2021

Modeling the Unigram Distribution

The unigram distribution is the non-contextual probability of finding a ...
research
11/27/2019

SimpleBooks: Long-term dependency book dataset with simplified English vocabulary for word-level language modeling

With language modeling becoming the popular base task for unsupervised r...
research
09/02/2017

Patterns versus Characters in Subword-aware Neural Language Modeling

Words in some natural languages can have a composite structure. Elements...
research
12/01/2022

Extensible Prompts for Language Models

We propose eXtensible Prompt (X-Prompt) for prompting a large language m...
research
12/02/2022

Nonparametric Masked Language Modeling

Existing language models (LMs) predict tokens with a softmax over a fini...
research
06/22/2020

Clinical Predictive Keyboard using Statistical and Neural Language Modeling

A language model can be used to predict the next word during authoring, ...

Please sign up or login with your details

Forgot password? Click here to reset