DeepAI AI Chat
Log In Sign Up

The EOS Decision and Length Extrapolation

by   Benjamin Newman, et al.

Extrapolation to unseen sequence lengths is a challenge for neural generative models of language. In this work, we characterize the effect on length extrapolation of a modeling decision often overlooked: predicting the end of the generative process through the use of a special end-of-sequence (EOS) vocabulary item. We study an oracle setting - forcing models to generate to the correct sequence length at test time - to compare the length-extrapolative behavior of networks trained to predict EOS (+EOS) with networks not trained to (-EOS). We find that -EOS substantially outperforms +EOS, for example extrapolating well to lengths 10 times longer than those seen at training time in a bracket closing task, as well as achieving a 40 the difficult SCAN dataset length generalization task. By comparing the hidden states and dynamics of -EOS and +EOS models, we observe that +EOS models fail to generalize because they (1) unnecessarily stratify their hidden states by their linear position is a sequence (structures we call length manifolds) or (2) get stuck in clusters (which we refer to as length attractors) once the EOS token is the highest-probability prediction.


Dynamic Prediction Length for Time Series with Sequence to Sequence Networks

Recurrent neural networks and sequence to sequence models require a pred...

Constructing sparse Davenport-Schinzel sequences by hypergraph edge coloring

A sequence is called r-sparse if every contiguous subsequence of length ...

State-Reification Networks: Improving Generalization by Modeling the Distribution of Hidden Representations

Machine learning promises methods that generalize well from finite label...

Randomized Positional Encodings Boost Length Generalization of Transformers

Transformers have impressive generalization capabilities on tasks with a...

Sequence Level Training with Recurrent Neural Networks

Many natural language processing applications use language models to gen...

Sequence Length is a Domain: Length-based Overfitting in Transformer Models

Transformer-based sequence-to-sequence architectures, while achieving st...

How Many Pages? Paper Length Prediction from the Metadata

Being able to predict the length of a scientific paper may be helpful in...