Lightweight End-to-End Speech Recognition from Raw Audio Data Using Sinc-Convolutions

10/15/2020 ∙ by Ludwig Kürzinger, et al. ∙ 0

Many end-to-end Automatic Speech Recognition (ASR) systems still rely on pre-processed frequency-domain features that are handcrafted to emulate the human hearing. Our work is motivated by recent advances in integrated learnable feature extraction. For this, we propose Lightweight Sinc-Convolutions (LSC) that integrate Sinc-convolutions with depthwise convolutions as a low-parameter machine-learnable feature extraction for end-to-end ASR systems. We integrated LSC into the hybrid CTC/attention architecture for evaluation. The resulting end-to-end model shows smooth convergence behaviour that is further improved by applying SpecAugment in time-domain. We also discuss filter-level improvements, such as using log-compression as activation function. Our model achieves a word error rate of 10.7 dataset, surpassing the corresponding architecture with log-mel filterbank features by an absolute 1.9



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.