Unsupervised Word Discovery with Segmental Neural Language Models

11/23/2018
by   Kazuya Kawakami, et al.
0

We propose a segmental neural language model that combines the representational power of neural networks and the structure learning mechanism of Bayesian nonparametrics, and show that it learns to discover semantically meaningful units (e.g., morphemes and words) from unsegmented character sequences. The model generates text as a sequence of segments, where each segment is generated either character-by-character from a sequence model or as a single draw from a lexical memory that stores multi-character units. Its parameters are fit to maximize the marginal likelihood of the training data, summing over all segmentations of the input, and its hyperparameters are likewise set to optimize held-out marginal likelihood. To prevent the model from overusing the lexical memory, which leads to poor generalization and bad segmentation, we introduce a differentiable regularizer that penalizes based on the expected length of each segment. To our knowledge, this is the first demonstration of neural networks that have predictive distributions better than LSTM language models and also infer a segmentation into word-like units that are competitive with the best existing word discovery models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/20/2017

Syllable-aware Neural Language Models: A Failure to Beat Character-aware Ones

Syllabification does not seem to improve word-level RNN language modelin...
research
09/06/2021

You should evaluate your language model on marginal likelihood overtokenisations

Neural language models typically tokenise input text into sub-word units...
research
06/06/2016

Gated Word-Character Recurrent Language Model

We introduce a recurrent neural network language model (RNN-LM) with lon...
research
10/06/2022

Are word boundaries useful for unsupervised language learning?

Word or word-fragment based Language Models (LM) are typically preferred...
research
03/02/2021

Unsupervised Word Segmentation with Bi-directional Neural Language Model

We present an unsupervised word segmentation model, in which the learnin...
research
12/12/2020

Mapping the Timescale Organization of Neural Language Models

In the human brain, sequences of language input are processed within a d...
research
02/23/2018

Reusing Weights in Subword-aware Neural Language Models

We propose several ways of reusing subword embeddings and other weights ...

Please sign up or login with your details

Forgot password? Click here to reset