
Dynamic Prediction Length for Time Series with Sequence to Sequence Networks
Recurrent neural networks and sequence to sequence models require a pred...
read it

Constructing sparse DavenportSchinzel sequences by hypergraph edge coloring
A sequence is called rsparse if every contiguous subsequence of length ...
read it

StateReification Networks: Improving Generalization by Modeling the Distribution of Hidden Representations
Machine learning promises methods that generalize well from finite label...
read it

Make A Long Image Short: Adaptive Token Length for Vision Transformers
The vision transformer splits each image into a sequence of tokens with ...
read it

How Many Pages? Paper Length Prediction from the Metadata
Being able to predict the length of a scientific paper may be helpful in...
read it

The Gaussian equivalence of generative models for learning with twolayer neural networks
Understanding the impact of data structure on learning in neural network...
read it

Globehopping
We consider versions of the grasshopper problem (Goulko and Kent, 2017) ...
read it
The EOS Decision and Length Extrapolation
Extrapolation to unseen sequence lengths is a challenge for neural generative models of language. In this work, we characterize the effect on length extrapolation of a modeling decision often overlooked: predicting the end of the generative process through the use of a special endofsequence (EOS) vocabulary item. We study an oracle setting  forcing models to generate to the correct sequence length at test time  to compare the lengthextrapolative behavior of networks trained to predict EOS (+EOS) with networks not trained to (EOS). We find that EOS substantially outperforms +EOS, for example extrapolating well to lengths 10 times longer than those seen at training time in a bracket closing task, as well as achieving a 40 the difficult SCAN dataset length generalization task. By comparing the hidden states and dynamics of EOS and +EOS models, we observe that +EOS models fail to generalize because they (1) unnecessarily stratify their hidden states by their linear position is a sequence (structures we call length manifolds) or (2) get stuck in clusters (which we refer to as length attractors) once the EOS token is the highestprobability prediction.
READ FULL TEXT
Comments
There are no comments yet.