Streaming Joint Speech Recognition and Disfluency Detection

by   Hayato Futami, et al.

Disfluency detection has mainly been solved in a pipeline approach, as post-processing of speech recognition. In this study, we propose Transformer-based encoder-decoder models that jointly solve speech recognition and disfluency detection, which work in a streaming manner. Compared to pipeline approaches, the joint models can leverage acoustic information that makes disfluency detection robust to recognition errors and provide non-verbal clues. Moreover, joint modeling results in low-latency and lightweight inference. We investigate two joint model variants for streaming disfluency detection: a transcript-enriched model and a multi-task model. The transcript-enriched model is trained on text with special tags indicating the starting and ending points of the disfluent part. However, it has problems with latency and standard language model adaptation, which arise from the additional disfluency tags. We propose a multi-task model to solve such problems, which has two output layers at the Transformer decoder; one for speech recognition and the other for disfluency detection. It is modeled to be conditioned on the currently recognized token with an additional token-dependency mechanism. We show that the proposed joint models outperformed a BERT-based pipeline approach in both accuracy and latency, on both the Switchboard and the corpus of spontaneous Japanese.


page 1

page 2

page 3

page 4


Transformer Transducer: One Model Unifying Streaming and Non-streaming Speech Recognition

In this paper we present a Transformer-Transducer model architecture and...

Developing Real-time Streaming Transformer Transducer for Speech Recognition on Large-scale Dataset

Recently, Transformer based end-to-end models have achieved great succes...

A Dual-Decoder Conformer for Multilingual Speech Recognition

Transformer-based models have recently become very popular for sequence-...

Dynamic latency speech recognition with asynchronous revision

In this work we propose an inference technique, asynchronous revision, t...

Teaching BERT to Wait: Balancing Accuracy and Latency for Streaming Disfluency Detection

In modern interactive speech-based systems, speech is consumed and trans...

Super-Human Performance in Online Low-latency Recognition of Conversational Speech

Achieving super-human performance in recognizing human speech has been a...

TEVR: Improving Speech Recognition by Token Entropy Variance Reduction

This paper presents TEVR, a speech recognition model designed to minimiz...

Please sign up or login with your details

Forgot password? Click here to reset