Modeling the Unigram Distribution

by   Irene Nikkarinen, et al.

The unigram distribution is the non-contextual probability of finding a specific word form in a corpus. While of central importance to the study of language, it is commonly approximated by each word's sample frequency in the corpus. This approach, being highly dependent on sample size, assigns zero probability to any out-of-vocabulary (oov) word form. As a result, it produces negatively biased probabilities for any oov word form, while positively biased probabilities to in-corpus words. In this work, we argue in favor of properly modeling the unigram distribution – claiming it should be a central task in natural language processing. With this in mind, we present a novel model for estimating it in a language (a neuralization of Goldwater et al.'s (2011) model) and show it produces much better estimates across a diverse set of 7 languages than the naïve use of neural character-level language models.


page 1

page 2

page 3

page 4


Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling

Fixed-vocabulary language models fail to account for one of the most cha...

Similarity-Based Models of Word Cooccurrence Probabilities

In many applications of natural language processing (NLP) it is necessar...

Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation

We introduce a model for constructing vector representations of words by...

Integrating Approaches to Word Representation

The problem of representing the atomic elements of language in modern ne...

SimpleBooks: Long-term dependency book dataset with simplified English vocabulary for word-level language modeling

With language modeling becoming the popular base task for unsupervised r...

Neutral evolution and turnover over centuries of English word popularity

Here we test Neutral models against the evolution of English word freque...

Neural Based Statement Classification for Biased Language

Biased language commonly occurs around topics which are of controversial...