Learning Filterbanks from Raw Speech for Phone Recognition

11/03/2017
by   Neil Zeghidour, et al.
0

We train a bank of complex filters that operates on the raw waveform and feeds into a convolutional neural network for end-to-end phone recognition. These time-domain filterbanks (TD-filterbanks) are initialized as an approximation of mel-filterbanks (MFSC, for mel-frequency spectral coefficients), and then fine-tuned jointly with the remaining convolutional architecture. We perform phone recognition experiments on TIMIT and show that for several architectures, models trained on TD-filterbanks consistently outperform their counterparts trained on comparable MFSC. We get our best performance by learning all front-end steps, from pre-emphasis up to averaging. Finally, we observe that the filters at convergence have an asymmetric impulse response, and that some of them remain almost analytic.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset