Character-Aware Attention-Based End-to-End Speech Recognition

01/06/2020
by   Zhong Meng, et al.
5

Predicting words and subword units (WSUs) as the output has shown to be effective for the attention-based encoder-decoder (AED) model in end-to-end speech recognition. However, as one input to the decoder recurrent neural network (RNN), each WSU embedding is learned independently through context and acoustic information in a purely data-driven fashion. Little effort has been made to explicitly model the morphological relationships among WSUs. In this work, we propose a novel character-aware (CA) AED model in which each WSU embedding is computed by summarizing the embeddings of its constituent characters using a CA-RNN. This WSU-independent CA-RNN is jointly trained with the encoder, the decoder and the attention network of a conventional AED to predict WSUs. With CA-AED, the embeddings of morphologically similar WSUs are naturally and directly correlated through the CA-RNN in addition to the semantic and acoustic relations modeled by a traditional AED. Moreover, CA-AED significantly reduces the model parameters in a traditional AED by replacing the large pool of WSU embeddings with a much smaller set of character embeddings. On a 3400 hours Microsoft Cortana dataset, CA-AED achieves up to 11.9 model parameters.

READ FULL TEXT
research
05/10/2018

A comparable study of modeling units for end-to-end Mandarin speech recognition

End-To-End speech recognition have become increasingly popular in mandar...
research
05/18/2020

Attention-based Transducer for Online Speech Recognition

Recent studies reveal the potential of recurrent neural network transduc...
research
09/15/2021

Tied Reduced RNN-T Decoder

Previous works on the Recurrent Neural Network-Transducer (RNN-T) models...
research
03/01/2022

Parameter estimation for WMTI-Watson model of white matter using encoder-decoder recurrent neural network

Biophysical modelling of the diffusion MRI signal provides estimates of ...
research
08/05/2015

Listen, Attend and Spell

We present Listen, Attend and Spell (LAS), a neural network that learns ...
research
09/15/2023

Chunked Attention-based Encoder-Decoder Model for Streaming Speech Recognition

We study a streamable attention-based encoder-decoder model in which eit...
research
07/02/2019

Learning to Reformulate the Queries on the WEB

Inability of the naive users to formulate appropriate queries is a funda...

Please sign up or login with your details

Forgot password? Click here to reset