Smoothing and Shrinking the Sparse Seq2Seq Search Space

03/18/2021
by   Ben Peters, et al.
0

Current sequence-to-sequence models are trained to minimize cross-entropy and use softmax to compute the locally normalized probabilities over target sequences. While this setup has led to strong results in a variety of tasks, one unsatisfying aspect is its length bias: models give high scores to short, inadequate hypotheses and often make the empty string the argmax – the so-called cat got your tongue problem. Recently proposed entmax-based sparse sequence-to-sequence models present a possible solution, since they can shrink the search space by assigning zero probability to bad hypotheses, but their ability to handle word-level tasks with transformers has never been tested. In this work, we show that entmax-based models effectively solve the cat got your tongue problem, removing a major source of model error for neural machine translation. In addition, we generalize label smoothing, a critical regularization technique, to the broader family of Fenchel-Young losses, which includes both cross-entropy and the entmax losses. Our resulting label-smoothed entmax loss models set a new state of the art on multilingual grapheme-to-phoneme conversion and deliver improvements and better calibration properties on cross-lingual morphological inflection and machine translation for 6 language pairs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/17/2019

Generalizing Back-Translation in Neural Machine Translation

Back-translation - data augmentation by translating target monolingual d...
research
05/14/2019

Sparse Sequence-to-Sequence Models

Sequence-to-sequence models are a powerful workhorse of NLP. Most varian...
research
06/30/2021

Mixed Cross Entropy Loss for Neural Machine Translation

In neural machine translation, cross entropy (CE) is the standard loss f...
research
05/31/2021

An Exploratory Analysis of Multilingual Word-Level Quality Estimation with Cross-Lingual Transformers

Most studies on word-level Quality Estimation (QE) of machine translatio...
research
08/04/2021

PARADISE: Exploiting Parallel Data for Multilingual Sequence-to-Sequence Pretraining

Despite the success of multilingual sequence-to-sequence pretraining, mo...
research
12/10/2018

Von Mises-Fisher Loss for Training Sequence to Sequence Models with Continuous Outputs

The Softmax function is used in the final layer of nearly all existing s...
research
05/20/2018

Algorithms and Theory for Multiple-Source Adaptation

This work includes a number of novel contributions for the multiple-sour...

Please sign up or login with your details

Forgot password? Click here to reset