Convolutional Speech Recognition with Pitch and Voice Quality Features

09/02/2020
by   Guillermo Cámbara, et al.
0

The effects of adding pitch and voice quality features such as jitter and shimmer to a state-of-the-art CNN model for Automatic Speech Recognition are studied in this work. Pitch features have been previously used for improving classical HMM and DNN baselines, while jitter and shimmer parameters have proven to be useful for tasks like speaker or emotion recognition. Up to our knowledge, this is the first work combining such pitch and voice quality features with modern convolutional architectures, showing improvements up to 2 absolute WER points, for the publicly available Spanish Common Voice dataset. Particularly, our work combines these features with mel-frequency spectral coefficients (MFSCs) to train a convolutional architecture with Gated Linear Units (Conv GLUs). Such models have shown to yield small word error rates, while being very suitable for parallel processing for online streaming recognition use cases. We have added pitch and voice quality functionality to Facebook's wav2letter speech recognition framework, and we provide with such code and recipes to the community, to carry on with further experiments. Besides, to the best of our knowledge, our Spanish Common Voice recipe is the first public Spanish recipe for wav2letter.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/21/2021

Voice Quality and Pitch Features in Transformer-Based Speech Recognition

Jitter and shimmer measurements have shown to be carriers of voice quali...
research
05/10/2021

What shall we do with an hour of data? Speech recognition for the un- and under-served languages of Common Voice

This technical report describes the methods and results of a three-week ...
research
03/17/2022

Robust and Complex Approach of Pathological Speech Signal Analysis

This paper presents a study of the approaches in the state-of-the-art in...
research
06/18/2021

Analysis and Tuning of a Voice Assistant System for Dysfluent Speech

Dysfluencies and variations in speech pronunciation can severely degrade...
research
07/30/2021

The History of Speech Recognition to the Year 2030

The decade from 2010 to 2020 saw remarkable improvements in automatic sp...
research
02/20/2021

The Use of Voice Source Features for Sung Speech Recognition

In this paper, we ask whether vocal source features (pitch, shimmer, jit...
research
09/03/2020

Knowing What to Listen to: Early Attention for Deep Speech Representation Learning

Deep learning techniques have considerably improved speech processing in...

Please sign up or login with your details

Forgot password? Click here to reset