Align With Purpose: Optimize Desired Properties in CTC Models with a General Plug-and-Play Framework

07/04/2023
by   Eliya Segev, et al.
0

Connectionist Temporal Classification (CTC) is a widely used criterion for training supervised sequence-to-sequence (seq2seq) models. It enables learning the relations between input and output sequences, termed alignments, by marginalizing over perfect alignments (that yield the ground truth), at the expense of imperfect alignments. This binary differentiation of perfect and imperfect alignments falls short of capturing other essential alignment properties that hold significance in other real-world applications. Here we propose Align With Purpose, a general Plug-and-Play framework for enhancing a desired property in models trained with the CTC criterion. We do that by complementing the CTC with an additional loss term that prioritizes alignments according to a desired property. Our method does not require any intervention in the CTC loss function, enables easy optimization of a variety of properties, and allows differentiation between both perfect and imperfect alignments. We apply our framework in the domain of Automatic Speech Recognition (ASR) and show its generality in terms of property selection, architectural choice, and scale of training dataset (up to 280,000 hours). To demonstrate the effectiveness of our framework, we apply it to two unrelated properties: emission time and word error rate (WER). For the former, we report an improvement of up to 570ms in latency optimization with a minor reduction in WER, and for the latter, we report a relative improvement of 4.5 WER over the baseline models. To the best of our knowledge, these applications have never been demonstrated to work on a scale of data as large as ours. Notably, our method can be implemented using only a few lines of code, and can be extended to other alignment-free loss functions and to domains other than ASR.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/01/2023

Bypass Temporal Classification: Weakly Supervised Automatic Speech Recognition with Imperfect Transcripts

This paper presents a novel algorithm for building an automatic speech r...
research
10/21/2020

FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization

Streaming automatic speech recognition (ASR) aims to emit each hypothesi...
research
12/05/2017

Minimum Word Error Rate Training for Attention-based Sequence-to-Sequence Models

Sequence-to-sequence models, such as attention-based models in automatic...
research
04/19/2022

An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition

The two most popular loss functions for streaming end-to-end automatic s...
research
04/20/2020

WHALETRANS: E2E WHisper to nAturaL spEech conversion using modified TRANSformer network

In this article, we investigate whispered-to natural-speech conversion m...
research
06/14/2021

Assessing the Use of Prosody in Constituency Parsing of Imperfect Transcripts

This work explores constituency parsing on automatically recognized tran...
research
10/14/2022

Bayes risk CTC: Controllable CTC alignment in Sequence-to-Sequence tasks

Sequence-to-Sequence (seq2seq) tasks transcribe the input sequence to a ...

Please sign up or login with your details

Forgot password? Click here to reset