Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions

04/04/2019
by   Awni Hannun, et al.
0

We propose a fully convolutional sequence-to-sequence encoder architecture with a simple and efficient decoder. Our model improves WER on LibriSpeech while being an order of magnitude more efficient than a strong RNN baseline. Key to our approach is a time-depth separable convolution block which dramatically reduces the number of parameters in the model while keeping the receptive field large. We also give a stable and efficient beam search inference procedure which allows us to effectively integrate a language model. Coupled with a convolutional language model, our time-depth separable convolution architecture improves by more than 22 previously reported sequence-to-sequence results on the noisy LibriSpeech test set.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/06/2017

An analysis of incorporating an external language model into a sequence-to-sequence model

Attention-based sequence-to-sequence models for automatic speech recogni...
research
07/22/2018

Multi-scale Alignment and Contextual History for Attention Mechanism in Sequence-to-sequence Model

A sequence-to-sequence model is a neural network module for mapping two ...
research
07/13/2019

Learn Spelling from Teachers: Transferring Knowledge from Language Models to Sequence-to-Sequence Speech Recognition

Integrating an external language model into a sequence-to-sequence speec...
research
06/09/2017

Depthwise Separable Convolutions for Neural Machine Translation

Depthwise separable convolutions reduce the number of parameters and com...
research
12/30/2020

Can Sequence-to-Sequence Models Crack Substitution Ciphers?

Decipherment of historical ciphers is a challenging problem. The languag...
research
05/17/2022

Utterance Weighted Multi-Dilation Temporal Convolutional Networks for Monaural Speech Dereverberation

Speech dereverberation is an important stage in many speech technology a...
research
11/28/2016

Dense Prediction on Sequences with Time-Dilated Convolutions for Speech Recognition

In computer vision pixelwise dense prediction is the task of predicting ...

Please sign up or login with your details

Forgot password? Click here to reset