A segmental framework for fully-unsupervised large-vocabulary speech recognition

06/22/2016
by   Herman Kamper, et al.
0

Zero-resource speech technology is a growing research area that aims to develop methods for speech processing in the absence of transcriptions, lexicons, or language modelling text. Early term discovery systems focused on identifying isolated recurring patterns in a corpus, while more recent full-coverage systems attempt to completely segment and cluster the audio into word-like units---effectively performing unsupervised speech recognition. This article presents the first attempt we are aware of to apply such a system to large-vocabulary multi-speaker data. Our system uses a Bayesian modelling framework with segmental word representations: each word segment is represented as a fixed-dimensional acoustic embedding obtained by mapping the sequence of feature frames to a single embedding vector. We compare our system on English and Xitsonga datasets to state-of-the-art baselines, using a variety of measures including word error rate (obtained by mapping the unsupervised output to ground truth transcriptions). Very high word error rates are reported---in the order of 70--80 systems---highlighting the difficulty of this task. Nevertheless, in terms of cluster quality and word segmentation metrics, we show that by imposing a consistent top-down segmentation while also using bottom-up knowledge from detected syllable boundaries, both single-speaker and multi-speaker versions of our system outperform a purely bottom-up single-speaker syllable-based approach. We also show that the discovered clusters can be made less speaker- and gender-specific by using an unsupervised autoencoder-like feature extractor to learn better frame-level features (prior to embedding). Our system's discovered clusters are still less pure than those of unsupervised term discovery systems, but provide far greater coverage.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/09/2016

Unsupervised word segmentation and lexicon discovery using acoustic word embeddings

In settings where only unlabelled speech data is available, speech techn...
research
01/03/2017

Unsupervised neural and Bayesian models for zero-resource speech processing

In settings where only unlabelled speech data is available, zero-resourc...
research
06/08/2021

Neural Speaker Embeddings for Ultrasound-based Silent Speech Interfaces

Articulatory-to-acoustic mapping seeks to reconstruct speech from a reco...
research
05/29/2022

Speaker Identification using Speech Recognition

The audio data is increasing day by day throughout the globe with the in...
research
07/22/2022

Toward Fairness in Speech Recognition: Discovery and mitigation of performance disparities

As for other forms of AI, speech recognition has recently been examined ...
research
04/01/2018

Completely Unsupervised Phoneme Recognition by Adversarially Learning Mapping Relationships from Audio Embeddings

Unsupervised discovery of acoustic tokens from audio corpora without ann...
research
03/28/2020

Unsupervised feature learning for speech using correspondence and Siamese networks

In zero-resource settings where transcribed speech audio is unavailable,...

Please sign up or login with your details

Forgot password? Click here to reset