Optimal Completion Distillation for Sequence Learning

by   Sara Sabour, et al.

We present Optimal Completion Distillation (OCD), a training procedure for optimizing sequence to sequence models based on edit distance. OCD is efficient, has no hyper-parameters of its own, and does not require pretraining or joint optimization with conditional log-likelihood. Given a partial sequence generated by the model, we first identify the set of optimal suffixes that minimize the total edit distance, using an efficient dynamic programming algorithm. Then, for each position of the generated sequence, we use a target distribution that puts equal probability on the first token of all the optimal suffixes. OCD achieves the state-of-the-art performance on end-to-end speech recognition, on both Wall Street Journal and Librispeech datasets, achieving 9.3% WER and 4.5% WER respectively.


Sequence-Level Knowledge Distillation for Model Compression of Attention-based Sequence-to-Sequence Speech Recognition

We investigate the feasibility of sequence-level knowledge distillation ...

An Improved Algorithm for The k-Dyck Edit Distance Problem

A Dyck sequence is a sequence of opening and closing parentheses (of var...

Promising Accurate Prefix Boosting for sequence-to-sequence ASR

In this paper, we present promising accurate prefix boosting (PAPB), a d...

Towards better decoding and language model integration in sequence to sequence models

The recently proposed Sequence-to-Sequence (seq2seq) framework advocates...

Latent Sequence Decompositions

We present the Latent Sequence Decompositions (LSD) framework. LSD decom...

Learning Online Alignments with Continuous Rewards Policy Gradient

Sequence-to-sequence models with soft attention had significant success ...

Dynamic Programming Approach to Template-based OCR

In this paper we propose a dynamic programming solution to the template-...

Code Repositories


Implementation of the Optimal Completion Distillation for Sequence Labeling

view repo