Autoregressive Modeling is Misspecified for Some Sequence Distributions

10/22/2020
by   Chu-Cheng Lin, et al.
0

Should sequences be modeled autoregressively—one symbol at a time? How much computation is needed to predict the next symbol? While local normalization is cheap, this also limits its power. We point out that some probability distributions over discrete sequences cannot be well-approximated by any autoregressive model whose runtime and parameter size grow polynomially in the sequence length—even though their unnormalized sequence probabilities are efficient to compute exactly. Intuitively, the probability of the next symbol can be expensive to compute or approximate (even via randomized algorithms) when it marginalizes over exponentially many possible futures, which is in general NP-hard. Our result is conditional on the widely believed hypothesis that NP⊈P/poly (without which the polynomial hierarchy would collapse at the second level). This theoretical observation serves as a caution to the viewpoint that pumping up parameter size is a straightforward way to improve autoregressive models (e.g., in language modeling). It also suggests that globally normalized (energy-based) models may sometimes outperform locally normalized (autoregressive) models, as we demonstrate experimentally for language modeling.

READ FULL TEXT
research
06/04/2018

Self-Normalization Properties of Language Modeling

Self-normalizing discriminative models approximate the normalized probab...
research
05/24/2019

Discrete Flows: Invertible Generative Models of Discrete Data

While normalizing flows have led to significant advances in modeling hig...
research
03/24/2022

Evaluating Distributional Distortion in Neural Language Modeling

A fundamental characteristic of natural language is the high rate at whi...
research
04/22/2020

Residual Energy-Based Models for Text Generation

Text generation is ubiquitous in many NLP tasks, from summarization, to ...
research
11/10/2020

Learning Discrete Energy-based Models via Auxiliary-variable Local Exploration

Discrete structures play an important role in applications like program ...
research
07/03/2017

Multiscale sequence modeling with a learned dictionary

We propose a generalization of neural network sequence models. Instead o...
research
10/04/2022

HYPRO: A Hybridly Normalized Probabilistic Model for Long-Horizon Prediction of Event Sequences

In this paper, we tackle the important yet under-investigated problem of...

Please sign up or login with your details

Forgot password? Click here to reset