Streaming Joint Speech Recognition and Disfluency Detection

11/16/2022
by   Hayato Futami, et al.
0

Disfluency detection has mainly been solved in a pipeline approach, as post-processing of speech recognition. In this study, we propose Transformer-based encoder-decoder models that jointly solve speech recognition and disfluency detection, which work in a streaming manner. Compared to pipeline approaches, the joint models can leverage acoustic information that makes disfluency detection robust to recognition errors and provide non-verbal clues. Moreover, joint modeling results in low-latency and lightweight inference. We investigate two joint model variants for streaming disfluency detection: a transcript-enriched model and a multi-task model. The transcript-enriched model is trained on text with special tags indicating the starting and ending points of the disfluent part. However, it has problems with latency and standard language model adaptation, which arise from the additional disfluency tags. We propose a multi-task model to solve such problems, which has two output layers at the Transformer decoder; one for speech recognition and the other for disfluency detection. It is modeled to be conditioned on the currently recognized token with an additional token-dependency mechanism. We show that the proposed joint models outperformed a BERT-based pipeline approach in both accuracy and latency, on both the Switchboard and the corpus of spontaneous Japanese.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/07/2020

Transformer Transducer: One Model Unifying Streaming and Non-streaming Speech Recognition

In this paper we present a Transformer-Transducer model architecture and...
research
10/22/2020

Developing Real-time Streaming Transformer Transducer for Speech Recognition on Large-scale Dataset

Recently, Transformer based end-to-end models have achieved great succes...
research
08/22/2021

A Dual-Decoder Conformer for Multilingual Speech Recognition

Transformer-based models have recently become very popular for sequence-...
research
11/03/2020

Dynamic latency speech recognition with asynchronous revision

In this work we propose an inference technique, asynchronous revision, t...
research
05/02/2022

Teaching BERT to Wait: Balancing Accuracy and Latency for Streaming Disfluency Detection

In modern interactive speech-based systems, speech is consumed and trans...
research
10/07/2020

Super-Human Performance in Online Low-latency Recognition of Conversational Speech

Achieving super-human performance in recognizing human speech has been a...
research
06/25/2022

TEVR: Improving Speech Recognition by Token Entropy Variance Reduction

This paper presents TEVR, a speech recognition model designed to minimiz...

Please sign up or login with your details

Forgot password? Click here to reset