Super-Human Performance in Online Low-latency Recognition of Conversational Speech

10/07/2020
by   Thai-Son Nguyen, et al.
0

Achieving super-human performance in recognizing human speech has been a goal for several decades, as researchers have worked on increasingly challenging tasks. In the 1990's it was discovered, that conversational speech between two humans turns out to be considerably more difficult than read speech as hesitations, disfluencies, false starts and sloppy articulation complicate acoustic processing and require robust handling of acoustic, lexical and language context, jointly. Early attempts even with statistical models could only reach error rates in excess of 50 around 5.5 models have considerably improved performance as such context can now be learned in an integral fashion. However processing such contexts requires presentation of an entire utterance and thus introduces unwanted delays before a recognition result can be output. In this paper, we address performance as well as latency. We present results for a system that is able to achieve super-human performance (at a WER or 5.0 benchmark) at a latency of only 1 second behind a speaker's speech. The system uses attention based encoder-decoder networks, but can also be configured to use ensembles with Transformer based models at low latency.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/27/2020

Transformer in action: a comparative study of transformer-based acoustic models for large scale speech recognition applications

In this paper, we summarize the application of transformer and its strea...
research
03/06/2017

English Conversational Telephone Speech Recognition by Humans and Machines

One of the most difficult speech recognition tasks is accurate recogniti...
research
05/22/2020

Low-Latency Sequence-to-Sequence Speech Recognition and Translation by Partial Hypothesis Selection

Encoder-decoder models provide a generic architecture for sequence-to-se...
research
10/21/2020

Emformer: Efficient Memory Transformer Based Acoustic Model For Low Latency Streaming Speech Recognition

This paper proposes an efficient memory transformer Emformer for low lat...
research
11/16/2022

Streaming Joint Speech Recognition and Disfluency Detection

Disfluency detection has mainly been solved in a pipeline approach, as p...
research
11/22/2021

Human-Machine Interaction Speech Corpus from the ROBIN project

This paper introduces a new Romanian speech corpus from the ROBIN projec...
research
04/15/2018

Twin Regularization for online speech recognition

Online speech recognition is crucial for developing natural human-machin...

Please sign up or login with your details

Forgot password? Click here to reset