End-to-End Speech Recognition From the Raw Waveform

06/19/2018
by   Neil Zeghidour, et al.
0

State-of-the-art speech recognition systems rely on fixed, hand-crafted features such as mel-filterbanks to preprocess the waveform before the training pipeline. In this paper, we study end-to-end systems trained directly from the raw waveform, building on two alternatives for trainable replacements of mel-filterbanks that use a convolutional architecture. The first one is inspired by gammatone filterbanks (Hoshen et al., 2015; Sainath et al, 2015), and the second one by the scattering transform (Zeghidour et al., 2017). We propose two modifications to these architectures and systematically compare them to mel-filterbanks, on the Wall Street Journal dataset. The first modification is the addition of an instance normalization layer, which greatly improves on the gammatone-based trainable filterbanks and speeds up the training of the scattering-based filterbanks. The second one relates to the low-pass filter used in these approaches. These modifications consistently improve performances for both approaches, and remove the need for a careful initialization in scattering-based trainable filterbanks. In particular, we show a consistent improvement in word error rate of the trainable filterbanks relatively to comparable mel-filterbanks. It is the first time end-to-end models trained from the raw signal significantly outperform mel-filterbanks on a large vocabulary task under clean recording conditions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/11/2016

Wav2Letter: an End-to-End ConvNet-based Speech Recognition System

This paper presents a simple end-to-end model for speech recognition, co...
research
12/17/2018

Fully Convolutional Speech Recognition

Current state-of-the-art speech recognition systems build on recurrent n...
research
11/03/2017

Learning Filterbanks from Raw Speech for Phone Recognition

We train a bank of complex filters that operates on the raw waveform and...
research
04/03/2019

End-to-end Binaural Sound Localisation from the Raw Waveform

A novel end-to-end binaural sound localisation approach is proposed whic...
research
07/26/2020

End-to-end spoofing detection with raw waveform CLDNNs

Albeit recent progress in speaker verification generates powerful models...
research
07/19/2018

ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech

In this work, we propose an alternative solution for parallel wave gener...
research
06/27/2022

Detection of Doctored Speech: Towards an End-to-End Parametric Learn-able Filter Approach

The Automatic Speaker Verification systems have potential in biometrics ...

Please sign up or login with your details

Forgot password? Click here to reset