Sequence Length is a Domain: Length-based Overfitting in Transformer Models

09/15/2021
by   Dušan Variš, et al.
0

Transformer-based sequence-to-sequence architectures, while achieving state-of-the-art results on a large number of NLP tasks, can still suffer from overfitting during training. In practice, this is usually countered either by applying regularization methods (e.g. dropout, L2-regularization) or by providing huge amounts of training data. Additionally, Transformer and other architectures are known to struggle when generating very long sequences. For example, in machine translation, the neural-based systems perform worse on very long sequences when compared to the preceding phrase-based translation approaches (Koehn and Knowles, 2017). We present results which suggest that the issue might also be in the mismatch between the length distributions of the training and validation data combined with the aforementioned tendency of the neural networks to overfit to the training data. We demonstrate on a simple string editing task and a machine translation task that the Transformer model performance drops significantly when facing sequences of length diverging from the length distribution in the training data. Additionally, we show that the observed drop in performance is due to the hypothesis length corresponding to the lengths seen by the model during training rather than the length of the input sequence.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/23/2016

CUNI System for WMT16 Automatic Post-Editing and Multimodal Translation Tasks

Neural sequence to sequence learning recently became a very promising pa...
research
12/16/2022

Reducing Sequence Length Learning Impacts on Transformer Models

Classification algorithms using Transformer architectures can be affecte...
research
07/31/2017

Regularization techniques for fine-tuning in neural machine translation

We investigate techniques for supervised domain adaptation for neural ma...
research
05/27/2019

Levenshtein Transformer

Modern neural sequence generation models are built to either generate to...
research
05/07/2021

Duplex Sequence-to-Sequence Learning for Reversible Machine Translation

Sequence-to-sequence (seq2seq) problems such as machine translation are ...
research
10/06/2021

How BPE Affects Memorization in Transformers

Training data memorization in NLP can both be beneficial (e.g., closed-b...
research
11/07/2020

Rethinking the Value of Transformer Components

Transformer becomes the state-of-the-art translation model, while it is ...

Please sign up or login with your details

Forgot password? Click here to reset