Improving Frame-level Classifier for Word Timings with Non-peaky CTC in End-to-End Automatic Speech Recognition

06/09/2023
by   Xianzhao Chen, et al.
0

End-to-end (E2E) systems have shown comparable performance to hybrid systems for automatic speech recognition (ASR). Word timings, as a by-product of ASR, are essential in many applications, especially for subtitling and computer-aided pronunciation training. In this paper, we improve the frame-level classifier for word timings in E2E system by introducing label priors in connectionist temporal classification (CTC) loss, which is adopted from prior works, and combining low-level Mel-scale filter banks with high-level ASR encoder output as input feature. On the internal Chinese corpus, the proposed method achieves 95.68 93.0 E2E approach with an absolute increase of 4.80 languages. In addition, we further improve word timing accuracy by delaying CTC peaks with frame-wise knowledge distillation, though only experimenting on LibriSpeech.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/07/2020

Homophone-based Label Smoothing in End-to-End Automatic Speech Recognition

A new label smoothing method that makes use of prior knowledge of a lang...
research
04/17/2019

Guiding CTC Posterior Spike Timings for Improved Posterior Fusion and Knowledge Distillation

Conventional automatic speech recognition (ASR) systems trained from fra...
research
05/21/2023

Hystoc: Obtaining word confidences for fusion of end-to-end ASR systems

End-to-end (e2e) systems have recently gained wide popularity in automat...
research
07/03/2019

End-to-End Speech Recognition with High-Frame-Rate Features Extraction

State-of-the-art end-to-end automatic speech recognition (ASR) extracts ...
research
03/11/2021

Learning Word-Level Confidence For Subword End-to-End ASR

We study the problem of word-level confidence estimation in subword-base...
research
04/15/2021

Cross-domain Speech Recognition with Unsupervised Character-level Distribution Matching

End-to-end automatic speech recognition (ASR) can achieve promising perf...
research
03/31/2022

Importance of Different Temporal Modulations of Speech: A Tale of Two Perspectives

How important are different temporal speech modulations for speech recog...

Please sign up or login with your details

Forgot password? Click here to reset