On Sparsifying Encoder Outputs in Sequence-to-Sequence Models

04/24/2020
by   Biao Zhang, et al.
0

Sequence-to-sequence models usually transfer all encoder outputs to the decoder for generation. In this work, by contrast, we hypothesize that these encoder outputs can be compressed to shorten the sequence delivered for decoding. We take Transformer as the testbed and introduce a layer of stochastic gates in-between the encoder and the decoder. The gates are regularized using the expected value of the sparsity-inducing L0penalty, resulting in completely masking-out a subset of encoder outputs. In other words, via joint training, the L0DROP layer forces Transformer to route information through a subset of its encoder states. We investigate the effects of this sparsification on two machine translation and two summarization tasks. Experiments show that, depending on the task, around 40-70 can be pruned without significantly compromising quality. The decrease of the output length endows L0DROP with the potential of improving decoding efficiency, where it yields a speedup of up to 1.65x on document summarization tasks against the standard Transformer. We analyze the L0DROP behaviour and observe that it exhibits systematic preferences for pruning certain word types, e.g., function words and punctuation get pruned most. Inspired by these observations, we explore the feasibility of specifying rule-based patterns that mask out encoder outputs based on information such as part-of-speech tags, word frequency and word position.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/24/2020

GRET: Global Representation Enhanced Transformer

Transformer, based on the encoder-decoder framework, has achieved state-...
research
04/15/2021

Hierarchical Learning for Generation with Long Source Sequences

One of the challenges for current sequence to sequence (seq2seq) models ...
research
02/20/2020

Balancing Cost and Benefit with Tied-Multi Transformers

We propose and evaluate a novel procedure for training multiple Transfor...
research
12/06/2019

Synchronous Transformers for End-to-End Speech Recognition

For most of the attention-based sequence-to-sequence models, the decoder...
research
05/24/2018

Deep Reinforcement Learning For Sequence to Sequence Models

In recent years, sequence-to-sequence (seq2seq) models are used in a var...
research
06/06/2017

Retrosynthetic reaction prediction using neural sequence-to-sequence models

We describe a fully data driven model that learns to perform a retrosynt...
research
11/10/2016

Efficient Summarization with Read-Again and Copy Mechanism

Encoder-decoder models have been widely used to solve sequence to sequen...

Please sign up or login with your details

Forgot password? Click here to reset