Log In Sign Up

Align-Refine: Non-Autoregressive Speech Recognition via Iterative Realignment

by   Ethan A. Chi, et al.

Non-autoregressive models greatly improve decoding speed over typical sequence-to-sequence models, but suffer from degraded performance. Infilling and iterative refinement models make up some of this gap by editing the outputs of a non-autoregressive model, but are constrained in the edits that they can make. We propose iterative realignment, where refinements occur over latent alignments rather than output sequence space. We demonstrate this in speech recognition with Align-Refine, an end-to-end Transformer-based model which refines connectionist temporal classification (CTC) alignments to allow length-changing insertions and deletions. Align-Refine outperforms Imputer and Mask-CTC, matching an autoregressive baseline on WSJ at 1/14th the real-time factor and attaining a LibriSpeech test-other WER of 9.0 model is strong even in one iteration with a shallower decoder.


page 1

page 2

page 3

page 4


Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech Recognition

Non-autoregressive transformer models have achieved extremely fast infer...

Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict

We present Mask CTC, a novel non-autoregressive end-to-end automatic spe...

Imputer: Sequence Modelling via Imputation and Dynamic Programming

This paper presents the Imputer, a neural sequence model that generates ...

Streaming Align-Refine for Non-autoregressive Deliberation

We propose a streaming non-autoregressive (non-AR) decoding algorithm to...

Non-Autoregressive Transformer ASR with CTC-Enhanced Decoder Input

Non-autoregressive (NAR) transformer models have achieved significantly ...

Knowledge Transfer and Distillation from Autoregressive to Non-Autoregressive Speech Recognition

Modern non-autoregressive (NAR) speech recognition systems aim to accele...

Attentional Speech Recognition Models Misbehave on Out-of-domain Utterances

We discuss the problem of echographic transcription in autoregressive se...