Speech Corpora Divergence Based Unsupervised Data Selection for ASR

02/26/2023
by   Changfeng Gao, et al.
0

Selecting application scenarios matching data is important for the automatic speech recognition (ASR) training, but it is difficult to measure the matching degree of the training corpus. This study proposes a unsupervised target-aware data selection method based on speech corpora divergence (SCD), which can measure the similarity between two speech corpora. We first use the self-supervised Hubert model to discretize the speech corpora into label sequence and calculate the N-gram probability distribution. Then we calculate the Kullback-Leibler divergence between the N-grams as the SCD. Finally, we can choose the subset which has minimum SCD to the target corpus for annotation and training. Compared to previous data selection method, the SCD data selection method can focus on more acoustic details and guarantee the diversity of the selected set. We evaluate our method on different accents from Common Voice. Experiments show that the proposed SCD data selection can realize 14.8 relative improvements to the random selection, comparable or even superior to the result of supervised selection.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/18/2022

Towards Representative Subset Selection for Self-Supervised Speech Recognition

Self-supervised speech recognition models require considerable labeled t...
research
05/16/2023

Adversarial Speaker Disentanglement Using Unannotated External Data for Self-supervised Representation Based Voice Conversion

Nowadays, recognition-synthesis-based methods have been quite popular wi...
research
09/15/2023

Diversity-based core-set selection for text-to-speech with linguistic and acoustic features

This paper proposes a method for extracting a lightweight subset from a ...
research
12/03/2022

Unsupervised Fine-Tuning Data Selection for ASR Using Self-Supervised Speech Models

Self-supervised learning (SSL) has been able to leverage unlabeled data ...
research
07/25/2022

Unsupervised data selection for Speech Recognition with contrastive loss ratios

This paper proposes an unsupervised data selection method by using a sub...
research
07/02/2019

Latent Dirichlet Allocation Based Acoustic Data Selection for Automatic Speech Recognition

Selecting in-domain data from a large pool of diverse and out-of-domain ...
research
08/27/2020

Automatic Speech Summarisation: A Scoping Review

Speech summarisation techniques take human speech as input and then outp...

Please sign up or login with your details

Forgot password? Click here to reset