Enhancing the Unified Streaming and Non-streaming Model with Contrastive Learning

06/01/2023
by   Yuting Yang, et al.
0

The unified streaming and non-streaming speech recognition model has achieved great success due to its comprehensive capabilities. In this paper, we propose to improve the accuracy of the unified model by bridging the inherent representation gap between the streaming and non-streaming modes with a contrastive objective. Specifically, the top-layer hidden representation at the same frame of the streaming and non-streaming modes are regarded as a positive pair, encouraging the representation of the streaming mode close to its non-streaming counterpart. The multiple negative samples are randomly selected from the rest frames of the same sample under the non-streaming mode. Experimental results demonstrate that the proposed method achieves consistent improvements toward the unified model in both streaming and non-streaming modes. Our method achieves CER of 4.66 in the non-streaming mode, which sets a new state-of-the-art on the AISHELL-1 benchmark.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/15/2021

UniST: Unified End-to-end Model for Streaming and Non-streaming Speech Translation

This paper presents a unified end-to-end frame-work for both streaming a...
research
05/14/2020

Streaming keyword spotting on mobile devices

In this work we explore the latency and accuracy of keyword spotting (KW...
research
10/27/2020

Cascaded encoders for unifying streaming and non-streaming ASR

End-to-end (E2E) automatic speech recognition (ASR) models, by now, have...
research
07/20/2023

Globally Normalising the Transducer for Streaming Speech Recognition

The Transducer (e.g. RNN-Transducer or Conformer-Transducer) generates a...
research
04/18/2023

Dynamic Chunk Convolution for Unified Streaming and Non-Streaming Conformer ASR

Recently, there has been an increasing interest in unifying streaming an...
research
06/27/2023

Reducing the gap between streaming and non-streaming Transducer-based ASR by adaptive two-stage knowledge distillation

Transducer is one of the mainstream frameworks for streaming speech reco...
research
12/03/2020

AugSplicing: Synchronized Behavior Detection in Streaming Tensors

How can we track synchronized behavior in a stream of time-stamped tuple...

Please sign up or login with your details

Forgot password? Click here to reset