Timestamped Embedding-Matching Acoustic-to-Word CTC ASR

06/20/2023
by   Woojay Jeon, et al.
0

In this work, we describe a novel method of training an embedding-matching word-level connectionist temporal classification (CTC) automatic speech recognizer (ASR) such that it directly produces word start times and durations, required by many real-world applications, in addition to the transcription. The word timestamps enable the ASR to output word segmentations and word confusion networks without relying on a secondary model or forced alignment process when testing. Our proposed system has similar word segmentation accuracy as a hybrid DNN-HMM (Deep Neural Network-Hidden Markov Model) system, with less than 3ms difference in mean absolute error in word start times on TIMIT data. At the same time, we observed less than 5 compared to the non-timestamped system when using the same audio training data and nearly identical model size. We also contribute more rigorous analysis of multiple-hypothesis embedding-matching ASR in general.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/30/2022

Improvements to Embedding-Matching Acoustic-to-Word ASR Using Multiple-Hypothesis Pronunciation-Based Embeddings

In embedding-matching acoustic-to-word (A2W) ASR, every word in the voca...
research
07/31/2020

Modular End-to-end Automatic Speech Recognition Framework for Acoustic-to-word Model

End-to-end (E2E) systems have played a more and more important role in a...
research
11/03/2022

Streaming Audio-Visual Speech Recognition with Alignment Regularization

Recognizing a word shortly after it is spoken is an important requiremen...
research
05/21/2023

Hystoc: Obtaining word confidences for fusion of end-to-end ASR systems

End-to-end (e2e) systems have recently gained wide popularity in automat...
research
08/29/2017

Information Theoretic Analysis of DNN-HMM Acoustic Modeling

We propose an information theoretic framework for quantitative assessmen...
research
05/17/2020

Wake Word Detection with Alignment-Free Lattice-Free MMI

Always-on spoken language interfaces, e.g. personal digital assistants, ...
research
11/08/2018

Confusion2Vec: Towards Enriching Vector Space Word Representations with Representational Ambiguities

Word vector representations are a crucial part of Natural Language Proce...

Please sign up or login with your details

Forgot password? Click here to reset