
SelfNormalization Properties of Language Modeling
Selfnormalizing discriminative models approximate the normalized probab...
read it

Discrete Flows: Invertible Generative Models of Discrete Data
While normalizing flows have led to significant advances in modeling hig...
read it

Glancing Transformer for NonAutoregressive Neural Machine Translation
Nonautoregressive neural machine translation achieves remarkable infere...
read it

Learning Discrete Energybased Models via Auxiliaryvariable Local Exploration
Discrete structures play an important role in applications like program ...
read it

Residual EnergyBased Models for Text Generation
Text generation is ubiquitous in many NLP tasks, from summarization, to ...
read it

Multiscale sequence modeling with a learned dictionary
We propose a generalization of neural network sequence models. Instead o...
read it

Generalizing and Hybridizing Countbased and Neural Language Models
Language models (LMs) are statistical models that calculate probabilitie...
read it
Autoregressive Modeling is Misspecified for Some Sequence Distributions
Should sequences be modeled autoregressively—one symbol at a time? How much computation is needed to predict the next symbol? While local normalization is cheap, this also limits its power. We point out that some probability distributions over discrete sequences cannot be wellapproximated by any autoregressive model whose runtime and parameter size grow polynomially in the sequence length—even though their unnormalized sequence probabilities are efficient to compute exactly. Intuitively, the probability of the next symbol can be expensive to compute or approximate (even via randomized algorithms) when it marginalizes over exponentially many possible futures, which is in general NPhard. Our result is conditional on the widely believed hypothesis that NP⊈P/poly (without which the polynomial hierarchy would collapse at the second level). This theoretical observation serves as a caution to the viewpoint that pumping up parameter size is a straightforward way to improve autoregressive models (e.g., in language modeling). It also suggests that globally normalized (energybased) models may sometimes outperform locally normalized (autoregressive) models, as we demonstrate experimentally for language modeling.
READ FULL TEXT
Comments
There are no comments yet.