Augmenting conformers with structured state space models for online speech recognition

09/15/2023
by   Haozhe Shan, et al.
0

Online speech recognition, where the model only accesses context to the left, is an important and challenging use case for ASR systems. In this work, we investigate augmenting neural encoders for online ASR by incorporating structured state-space sequence models (S4), which are a family of models that provide a parameter-efficient way of accessing arbitrarily long left context. We perform systematic ablation studies to compare variants of S4 models and propose two novel approaches that combine them with convolutions. We find that the most effective design is to stack a small S4 using real-valued recurrent weights with a local convolution, allowing them to work complementarily. Our best model achieves WERs of 4.01 outperforming Conformers with extensively tuned convolution.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/16/2020

Conformer: Convolution-augmented Transformer for Speech Recognition

Recently Transformer and Convolution neural network (CNN) based models h...
research
10/31/2022

Structured State Space Decoder for Speech Recognition and Synthesis

Automatic speech recognition (ASR) systems developed in recent years hav...
research
11/21/2018

Speech recognition with quaternion neural networks

Neural network architectures are at the core of powerful automatic speec...
research
02/18/2021

Echo State Speech Recognition

We propose automatic speech recognition (ASR) models inspired by echo st...
research
07/13/2023

Leveraging Pretrained ASR Encoders for Effective and Efficient End-to-End Speech Intent Classification and Slot Filling

We study speech intent classification and slot filling (SICSF) by propos...
research
01/27/2020

Scaling Up Online Speech Recognition Using ConvNets

We design an online end-to-end speech recognition system based on Time-D...
research
03/11/2023

Transcription free filler word detection with Neural semi-CRFs

Non-linguistic filler words, such as "uh" or "um", are prevalent in spon...

Please sign up or login with your details

Forgot password? Click here to reset