Modeling the Unigram Distribution

06/04/2021
by   Irene Nikkarinen, et al.
0

The unigram distribution is the non-contextual probability of finding a specific word form in a corpus. While of central importance to the study of language, it is commonly approximated by each word's sample frequency in the corpus. This approach, being highly dependent on sample size, assigns zero probability to any out-of-vocabulary (oov) word form. As a result, it produces negatively biased probabilities for any oov word form, while positively biased probabilities to in-corpus words. In this work, we argue in favor of properly modeling the unigram distribution – claiming it should be a central task in natural language processing. With this in mind, we present a novel model for estimating it in a language (a neuralization of Goldwater et al.'s (2011) model) and show it produces much better estimates across a diverse set of 7 languages than the naïve use of neural character-level language models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/23/2017

Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling

Fixed-vocabulary language models fail to account for one of the most cha...
research
09/27/1998

Similarity-Based Models of Word Cooccurrence Probabilities

In many applications of natural language processing (NLP) it is necessar...
research
02/07/2023

What do Language Models know about word senses? Zero-Shot WSD with Language Models and Domain Inventories

Language Models are the core for almost any Natural Language Processing ...
research
09/10/2021

Integrating Approaches to Word Representation

The problem of representing the atomic elements of language in modern ne...
research
03/30/2017

Neutral evolution and turnover over centuries of English word popularity

Here we test Neutral models against the evolution of English word freque...
research
11/27/2019

SimpleBooks: Long-term dependency book dataset with simplified English vocabulary for word-level language modeling

With language modeling becoming the popular base task for unsupervised r...
research
11/14/2018

Neural Based Statement Classification for Biased Language

Biased language commonly occurs around topics which are of controversial...

Please sign up or login with your details

Forgot password? Click here to reset