Multi-Modal Data Augmentation for End-to-end ASR

03/27/2018
by   Adithya Renduchintala, et al.
0

We present a new end-to-end architecture for automatic speech recognition (ASR) that can be trained using symbolic input in addition to the traditional acoustic input. This architecture utilizes two separate encoders: one for acoustic input and another for symbolic input, both sharing the attention and decoder parameters. We call this architecture a multi-modal data augmentation network (MMDA), as it can support multi-modal (acoustic and symbolic) input. The MMDA architecture attempts to eliminate the need for an external LM, by enabling seamless mixing of large text datasets with significantly smaller transcribed speech corpora during training. We study different ways of transforming large text corpora into a symbolic form suitable for training our MMDA network. Our best MMDA setup obtains small improvements on CER and achieves 8-10% relative WER improvement on the WSJ data set.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/19/2019

Generating Synthetic Audio Data for Attention-Based Speech Recognition Systems

Recent advances in text-to-speech (TTS) led to the development of flexib...
research
05/11/2023

Masked Audio Text Encoders are Effective Multi-Modal Rescorers

Masked Language Models (MLMs) have proven to be effective for second-pas...
research
01/30/2020

Oral Billiards

We propose a physical model of speech to explain its precision and robus...
research
07/28/2018

Back-Translation-Style Data Augmentation for End-to-End ASR

In this paper we propose a novel data augmentation method for attention-...
research
09/16/2023

Decoder-only Architecture for Speech Recognition with CTC Prompts and Text Data Augmentation

Collecting audio-text pairs is expensive; however, it is much easier to ...
research
06/01/2021

Multi-modal Point-of-Care Diagnostics for COVID-19 Based On Acoustics and Symptoms

The research direction of identifying acoustic bio-markers of respirator...
research
02/05/2021

Two-Stage Augmentation and Adaptive CTC Fusion for Improved Robustness of Multi-Stream End-to-End ASR

Performance degradation of an Automatic Speech Recognition (ASR) system ...

Please sign up or login with your details

Forgot password? Click here to reset