Super-Human Performance in Online Low-latency Recognition of Conversational Speech

10/07/2020

∙

Achieving super-human performance in recognizing human speech has been a goal for several decades, as researchers have worked on increasingly challenging tasks. In the 1990's it was discovered, that conversational speech between two humans turns out to be considerably more difficult than read speech as hesitations, disfluencies, false starts and sloppy articulation complicate acoustic processing and require robust handling of acoustic, lexical and language context, jointly. Early attempts even with statistical models could only reach error rates in excess of 50 around 5.5 models have considerably improved performance as such context can now be learned in an integral fashion. However processing such contexts requires presentation of an entire utterance and thus introduces unwanted delays before a recognition result can be output. In this paper, we address performance as well as latency. We present results for a system that is able to achieve super-human performance (at a WER or 5.0 benchmark) at a latency of only 1 second behind a speaker's speech. The system uses attention based encoder-decoder networks, but can also be configured to use ensembles with Transformer based models at low latency.

READ FULL TEXT

Super-Human Performance in Online Low-latency Recognition of Conversational Speech

Sign in with Google

Consider DeepAI Pro