From Semi-supervised to Almost-unsupervised Speech Recognition with Very-low Resource by Jointly Learning Phonetic Structures from Audio and Text Embeddings

04/10/2019
by   Yi-Chen Chen, et al.
0

Producing a large amount of annotated speech data for training ASR systems remains difficult for more than 95 low-resourced. However, we note human babies start to learn the language by the sounds (or phonetic structures) of a small number of exemplar words, and "generalize" such knowledge to other words without hearing a large amount of data. We initiate some preliminary work in this direction. Audio Word2Vec is used to learn the phonetic structures from spoken words (signal segments), while another autoencoder is used to learn the phonetic structures from text words. The relationships among the above two can be learned jointly, or separately after the above two are well trained. This relationship can be used in speech recognition with very low resource. In the initial experiments on the TIMIT dataset, only 2.1 hours of speech data (in which 2500 spoken words were annotated and the rest unlabeled) gave a word error rate of 44.6 number can be reduced to 34.2 words were annotated) were given. These results are not satisfactory, but a good starting point.

READ FULL TEXT
research
10/11/2022

Automatic Speech Recognition of Low-Resource Languages Based on Chukchi

The following paper presents a project focused on the research and creat...
research
04/08/2019

Completely Unsupervised Phoneme Recognition By A Generative Adversarial Network Harmonized With Iteratively Refined Hidden Markov Models

Producing a large annotated speech corpus for training ASR systems remai...
research
07/15/2018

Syllabification by Phone Categorization

Syllables play an important role in speech synthesis, speech recognition...
research
02/14/2017

A case study on using speech-to-translation alignments for language documentation

For many low-resource or endangered languages, spoken language resources...
research
11/12/2021

A Convolutional Neural Network Based Approach to Recognize Bangla Spoken Digits from Speech Signal

Speech recognition is a technique that converts human speech signals int...
research
11/12/2020

Enabling Interactive Transcription in an Indigenous Community

We propose a novel transcription workflow which combines spoken term det...

Please sign up or login with your details

Forgot password? Click here to reset