A Cognitive Regularizer for Language Modeling

05/15/2021
by   Jason Wei, et al.
0

The uniform information density (UID) hypothesis, which posits that speakers prefer utterances that distribute information uniformly across the signal, has gained substantial traction in psycholinguistics as an explanation for certain syntactic, morphological, and prosodic choices. Could we operationalize uniform information density as an inductive bias for statistical language modeling? In this paper, we augment the canonical MLE objective for training language models by encoding UID as regularization. In experiments on ten languages spanning five language families, we find that using UID regularization consistently improves perplexity in language models, having a larger effect when training data is limited. Moreover, via analysis of generated sequences, we find that UID-regularized language models are higher-entropy and produce text that is longer and more lexically diverse. Our results not only suggest that UID is a reasonable inductive bias for language modeling, but also provide an alternative validation of the UID hypothesis using modern-day NLP tools.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/07/2021

Understanding by Understanding Not: Modeling Negation in Language Models

Negation is a core construction in natural language. Despite being very ...
research
05/20/2023

Revisiting Entropy Rate Constancy in Text

The uniform information density (UID) hypothesis states that humans tend...
research
06/13/2023

Tokenization with Factorized Subword Encoding

In recent years, language models have become increasingly larger and mor...
research
12/11/2019

Just Add Functions: A Neural-Symbolic Language Model

Neural network language models (NNLMs) have achieved ever-improving accu...
research
09/23/2021

Revisiting the Uniform Information Density Hypothesis

The uniform information density (UID) hypothesis posits a preference amo...
research
03/24/2022

Evaluating Distributional Distortion in Neural Language Modeling

A fundamental characteristic of natural language is the high rate at whi...
research
04/29/2020

Evaluating Transformer-Based Multilingual Text Classification

As NLP tools become ubiquitous in today's technological landscape, they ...

Please sign up or login with your details

Forgot password? Click here to reset