A Measure-Theoretic Characterization of Tight Language Models

12/20/2022
by   Li Du, et al.
0

Language modeling, a central task in natural language processing, involves estimating a probability distribution over strings. In most cases, the estimated distribution sums to 1 over all finite strings. However, in some pathological cases, probability mass can “leak” onto the set of infinite sequences. In order to characterize the notion of leakage more precisely, this paper offers a measure-theoretic treatment of language modeling. We prove that many popular language model families are in fact tight, meaning that they will not leak in this sense. We also generalize characterizations of tightness proposed in previous works.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/10/2023

An Overview on Language Models: Recent Developments and Outlook

Language modeling studies the probability distributions over strings of ...
research
09/05/2016

PMI Matrix Approximations with Applications to Neural Language Modeling

The negative sampling (NEG) objective function, used in word2vec, is a s...
research
01/15/2017

Dialog Context Language Modeling with Recurrent Neural Networks

In this work, we propose contextual language models that incorporate dia...
research
09/08/2020

Probabilistic Predictions of People Perusing: Evaluating Metrics of Language Model Performance for Psycholinguistic Modeling

By positing a relationship between naturalistic reading times and inform...
research
10/25/2021

No News is Good News: A Critique of the One Billion Word Benchmark

The One Billion Word Benchmark is a dataset derived from the WMT 2011 Ne...
research
03/24/2022

Evaluating Distributional Distortion in Neural Language Modeling

A fundamental characteristic of natural language is the high rate at whi...
research
01/27/2018

A Characterization of Guesswork on Swiftly Tilting Curves

Given a collection of strings, each with an associated probability of oc...

Please sign up or login with your details

Forgot password? Click here to reset