Almost-unsupervised Speech Recognition with Close-to-zero Resource Based on Phonetic Structures Learned from Very Small Unpaired Speech and Text Data

by   Yi-Chen Chen, et al.

Producing a large amount of annotated speech data for training ASR systems remains difficult for more than 95 low-resourced. However, we note human babies start to learn the language by the sounds of a small number of exemplar words without hearing a large amount of data. We initiate some preliminary work in this direction in this paper. Audio Word2Vec is used to obtain embeddings of spoken words which carry phonetic information extracted from the signals. An autoencoder is used to generate embeddings of text words based on the articulatory features for the phoneme sequences. Both sets of embeddings for spoken and text words describe similar phonetic structures among words in their respective latent spaces. A mapping relation from the audio embeddings to text embeddings actually gives the word-level ASR. This can be learned by aligning a small number of spoken words and the corresponding text words in the embedding spaces. In the initial experiments only 200 annotated spoken words and one hour of speech data without annotation gave a word accuracy of 27.5 point.


Phonetic-and-Semantic Embedding of Spoken Words with Applications in Spoken Content Retrieval

Word embedding or Word2Vec has been successful in offering semantics for...

Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces

Recent research has shown that word embedding spaces learned from text c...

A Convolutional Neural Network Based Approach to Recognize Bangla Spoken Digits from Speech Signal

Speech recognition is a technique that converts human speech signals int...

Minimal Effective Theory for Phonotactic Memory: Capturing Local Correlations due to Errors in Speech

Spoken language evolves constrained by the economy of speech, which depe...

Improving TTS for Shanghainese: Addressing Tone Sandhi via Word Segmentation

Tone is a crucial component of the prosody of Shanghainese, a Wu Chinese...

Wake Word Detection with Alignment-Free Lattice-Free MMI

Always-on spoken language interfaces, e.g. personal digital assistants, ...

Please sign up or login with your details

Forgot password? Click here to reset