Multi-Span Acoustic Modelling using Raw Waveform Signals

06/21/2019
by   Patrick von Platen, et al.
0

Traditional automatic speech recognition (ASR) systems often use an acoustic model (AM) built on handcrafted acoustic features, such as log Mel-filter bank (FBANK) values. Recent studies found that AMs with convolutional neural networks (CNNs) can directly use the raw waveform signal as input. Given sufficient training data, these AMs can yield a competitive word error rate (WER) to those built on FBANK features. This paper proposes a novel multi-span structure for acoustic modelling based on the raw waveform with multiple streams of CNN input layers, each processing a different span of the raw waveform signal. Evaluation on both the single channel CHiME4 and AMI data sets show that multi-span AMs give a lower WER than FBANK AMs by an average of about 5 can learn filters that are rather different to the log Mel filters. Furthermore, the paper shows that a widely used single span raw waveform AM can be improved by using a smaller CNN kernel size and increased stride to yield improved WERs.

READ FULL TEXT
research
10/01/2016

Very Deep Convolutional Neural Networks for Raw Waveforms

Learning acoustic models directly from the raw waveform data with minima...
research
09/30/2019

Acoustic Model Adaptation from Raw Waveforms with SincNet

Raw waveform acoustic modelling has recently gained interest due to neur...
research
02/11/2020

CGCNN: Complex Gabor Convolutional Neural Network on raw speech

Convolutional Neural Networks (CNN) have been used in Automatic Speech R...
research
03/04/2021

End-to-End Mispronunciation Detection and Diagnosis From Raw Waveforms

Mispronunciation detection and diagnosis (MDD) is designed to identify p...
research
10/08/2021

A study of the robustness of raw waveform based speaker embeddings under mismatched conditions

In this paper, we conduct a cross-dataset study on parametric and non-pa...
research
08/08/2023

Comparative Analysis of the wav2vec 2.0 Feature Extractor

Automatic speech recognition (ASR) systems typically use handcrafted fea...
research
09/06/2021

Complementing Handcrafted Features with Raw Waveform Using a Light-weight Auxiliary Model

An emerging trend in audio processing is capturing low-level speech repr...

Please sign up or login with your details

Forgot password? Click here to reset