Spanish and English Phoneme Recognition by Training on Simulated Classroom Audio Recordings of Collaborative Learning Environments

02/21/2022
by   Mario Esparza, et al.
0

Audio recordings of collaborative learning environments contain a constant presence of cross-talk and background noise. Dynamic speech recognition between Spanish and English is required in these environments. To eliminate the standard requirement of large-scale ground truth, the thesis develops a simulated dataset by transforming audio transcriptions into phonemes and using 3D speaker geometry and data augmentation to generate an acoustic simulation of Spanish and English speech. The thesis develops a low-complexity neural network for recognizing Spanish and English phonemes (available at github.com/muelitas/keywordRec). When trained on 41 English phonemes, 0.099 PER is achieved on Speech Commands. When trained on 36 Spanish phonemes and tested on real recordings of collaborative learning environments, a 0.7208 LER is achieved. Slightly better than Google's Speech-to-text 0.7272 LER, which used anywhere from 15 to 1,635 times more parameters and trained on 300 to 27,500 hours of real data as opposed to 13 hours of simulated audios.

READ FULL TEXT

page 17

page 18

page 38

research
12/26/2021

Bilingual Speech Recognition by Estimating Speaker Geometry from Video Data

Speech recognition is very challenging in student learning environments ...
research
01/05/2022

Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

Video recordings of speech contain correlated audio and visual informati...
research
10/27/2022

Masked Autoencoders Are Articulatory Learners

Articulatory recordings track the positions and motion of different arti...
research
10/20/2020

Replacing Human Audio with Synthetic Audio for On-device Unspoken Punctuation Prediction

We present a novel multi-modal unspoken punctuation prediction system fo...
research
04/05/2021

SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition

In the English speech-to-text (STT) machine learning task, acoustic mode...
research
06/06/2023

RescueSpeech: A German Corpus for Speech Recognition in Search and Rescue Domain

Despite recent advancements in speech recognition, there are still diffi...
research
04/15/2022

Automated speech tools for helping communities process restricted-access corpora for language revival efforts

Many archival recordings of speech from endangered languages remain unan...

Please sign up or login with your details

Forgot password? Click here to reset