Structured State Space Decoder for Speech Recognition and Synthesis

10/31/2022
by   Koichi Miyazaki, et al.
0

Automatic speech recognition (ASR) systems developed in recent years have shown promising results with self-attention models (e.g., Transformer and Conformer), which are replacing conventional recurrent neural networks. Meanwhile, a structured state space model (S4) has been recently proposed, producing promising results for various long-sequence modeling tasks, including raw speech classification. The S4 model can be trained in parallel, same as the Transformer model. In this study, we applied S4 as a decoder for ASR and text-to-speech (TTS) tasks by comparing it with the Transformer decoder. For the ASR task, our experimental results demonstrate that the proposed model achieves a competitive word error rate (WER) of 1.88 test-clean/test-other set and a character error rate (CER) of 3.80 on the CSJ eval1/eval2/eval3 set. Furthermore, the proposed model is more robust than the standard Transformer model, particularly for long-form speech on both the datasets. For the TTS task, the proposed method outperforms the Transformer baseline.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/28/2018

Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese

Sequence-to-sequence attention-based models have recently shown very pro...
research
05/21/2023

Multi-Head State Space Model for Speech Recognition

State space models (SSMs) have recently shown promising results on small...
research
04/26/2019

Transformers with convolutional context for ASR

The recent success of transformer networks for neural machine translatio...
research
09/15/2023

Augmenting conformers with structured state space models for online speech recognition

Online speech recognition, where the model only accesses context to the ...
research
03/03/2020

Improving Uyghur ASR systems with decoders using morpheme-based language models

Uyghur is a minority language, and its resources for Automatic Speech Re...
research
09/14/2023

Echotune: A Modular Extractor Leveraging the Variable-Length Nature of Speech in ASR Tasks

The Transformer architecture has proven to be highly effective for Autom...
research
04/07/2023

Cleansing Jewel: A Neural Spelling Correction Model Built On Google OCR-ed Tibetan Manuscripts

Scholars in the humanities rely heavily on ancient manuscripts to study ...

Please sign up or login with your details

Forgot password? Click here to reset