Automatic Speech Recognition (ASR) has made striking progress in the recent years with the deployment of increasingly large deep neural networks trained on increasingly large sets of annotated speech (from thousands to tens of thousands of hours). This approach is hit by diminishing returns as the costs of annotating even larger datasets become prohibitive. It is also difficult to scale beyond a handful of high-resource languages and address the needs of a long tail of low-resource languages, dialectal and idiolectal variants, accents, and registers. As such, there has been a recent surge of interest in weakly supervised solutions that use datasets with fewer human annotations. In the semi-supervised setting, only a fraction of the dataset is labelled and the rest is unlabelled [tur2005, kahn2019self], while in a distant supervision setting, the dataset is mostly or entirely unlabelled, but large quantities of unaligned text provide a language model corpus [chen2018almost, chung2018]. Other approaches have addressed pretraining with labels from other languages [huang2013cross, vesely2012language] or pretraining using unsupervised objectives [vandenoord2018, schneider2019wav2vec]. At the extreme of this continuum, zero resource ASR discovers its own units from raw speech [Versteegh2016, dunbar2017, dunbar2019]. Despite many interesting results, the field lacks a common benchmark (datasets, evaluations, or baselines) for comparing ideas and results across these settings. Here, we introduce Libri-light, a large open-source corpus (60K hours) of unlabelled speech and a common set of metrics to evaluate three settings: (1) the zero-resource/unsupervised setting (ABX), (2) the semi-supervised setting (PER and CER), and (3) the distant supervision setting (WER). The last two settings use a limited-resource training set (10 min, 1h, 10h), and the last one large in-domain and out-of-domain text to train language models. The test sets are identical to LibriSpeech [librispeech2015] so as to facilitate comparison of weakly supervised results with the state-of-the art in supervised learning. We also provide a baseline system on these three settings. All datasets, metrics and baseline systems are open source111https://github.com/facebookresearch/libri-light.
2 Related work
The release of open source software and datasets has facilitated rapid progress in machine learning and in particular large vocabulary ASR. LibriSpeech is one of the first large-scale open-source datasets and contains over 1000 hours of audio books, together with textual annotations aligned at the sentence level. Mozilla’s CommonVoice project has facilitated data collection across several languages and currently contains 2900 hours of read speech in 37 languages222https://voice.mozilla.org. A. Black at CMU has compiled the Wilderness dataset which consists of the text of the Bible read in 750 languages [black2019cmu]. Other open-source resources are available from OpenSLR333http://openslr.org/.
The Zero Resource Challenge has released a series of datasets and metrics for the unsupervised setting [Versteegh2016, dunbar2017]444https://zerospeech.com, but the datasets are generally small (between 2.5 and 50 h). In this work, we substantially expand dataset size and use the same evaluation metrics (ABX [Schatz2013ABX]) for comparability. The IARPA Babel program [harper2014] has also initiated a push towards limited supervision for less studied languages. In its most difficult track, the dataset contains only 10 hours of transcribed speech in conjunction with with larger amounts of untranscribed audio. Here, we retain 10 hours as a upper bound, and add lower-resource sets containing 1 hours and 10 minutes of labeled audio. While distant supervision has been the focus of two JHU-JSALT workshops (2016 [liu2017empirical], 2019 [chorowski2019]) but no benchmark has yet emerged.
3 Dataset and metrics
|Unlabelled Speech Training Set|
|Limited Resource Training Set|
|Dev & Test Sets (from LibriSpeech)|
|Unaligned Text Training Set|
The dataset is composed four parts: a train set with unlabelled speech, a train set with limited labels, dev/test sets, and a train set containing unaligned text; see Table 1.
|ABX within speaker||ABX across speaker|
Unlabelled Speech Training Set. This dataset was obtained by extracting audio files for English speech from the LibriVox repository555https://librivox.org containing open source audio books. Files were downloaded and converted to 16kHz FLAC. We then removed corrupted files, files with unknown or multiple speakers, and speakers appearing in LibriSpeech dev and test sets. The potentially duplicated versions of books based on titles were set aside (and distributed as a duplicate subset, totalling 4500 hours). We then ran a Voice Activity Detection (VAD) model using the wav2letter++ framework [wav2letter++] on the recordings to tag onsets and offsets of speech segments. The VAD segments were used to derive an average SNR for each file. For each file, we constructed JSON metadata including title, unique speaker ID, SNR, genre, and list of valid VAD_block (block of more than 500ms of speech indicated by onsets and offsets). We created three dataset splits based on different sizes: (unlab-60k), (unlab-6k) and (unlab-600), matched in genre distribution (the smaller cuts are included in the larger ones, see the stats in Table 1). The distributions by speaker and genres are in Figure 1. The total amount of speech in the dataset is 62.2K hours, including the duplicate files.
Limited-resource Training Set. For training with limited supervision, we selected three subsets of the LibriSpeech training set: a 10 hour set, a 1 hour set, and six 10-minute sets (the six 10-minute sets together make up the 1h set, and the 1h set is included in the 10h set). In each set, half of the utterances are from the clean and other training sets, respectively. We additionally provide orthographic transcriptions from LibriSpeech and phonetic transcriptions generated from phonemizer666https://gitlab.coml.lscp.ens.fr/mbernard/phonemizer.
Dev and Test Set. The dev and test sets are the same as that of LibriSpeech (5.4 hours for dev-clean, 5.3 hours for dev-other, 5.4 hours for test-clean, and 5.1 hours for test-other) and are intended for testing and tuning. All dev or test set audio has been removed from training sets. The ground-truth phonetic sequences for the dev and test sets were generated as above; in addition, for ABX evaluation, forced alignment was obtained using a model trained on LibriSpeech.
Unaligned Text Training Set. For training a language model in the distant supervision setting, we consider the LM corpus provided in LibriSpeech777https://openslr.org/11/ which contains 800M tokens and a vocabulary size of 200k from 14.5k public books from Project Gutenberg. This corpus only partially overlaps with the content of our unlabelled training set and can thus be considered in-domain. Several options exist for publicly available out-of-domain corpora (wikitext103, 1Billion word, etc).
We provide 3 sets of metrics for the unsupervised, distantly-supervised, and semi-supervised settings.
For the unsupervised setting, the aim is to extract speech representations (discrete or continuous) which encode the phonetic content of the language while ignoring irrelevant information (channel, speaker, etc). The representation is evaluated using ABX error, a distance-based metric used in previous zero resource challenges [Versteegh2016, dunbar2017, dunbar2019]
. For a given pair of sounds (for instance, ”bit” versus ”bet”), we compute the probability that the distance between a random token of ”bit” (X) is closer to another token of ”bit” (A) than to a token of ”bet” (B). The ABX error rate is obtained by averaging across all such minimal pairs of phone trigrams in the corpus. For the “within-speaker” score, A, B and X are from the same speakers; in the “across-speaker” score, A and B are from the same speaker, but X is from a different speaker (see[schatz2016abx]).
For the semi-supervised setting, we evaluate the quality of learned acoustic representations with little annotated data. Models can be trained either with character or phonetic targets using limited data and measured by either Character Error Rate (CER) or Phoneme Error Rate (PER).
For distant supervision, we evaluate how the learned representations can be used to decode speech at the word level using a pre-trained language model. We use Word Error Rate (WER) for the evaluation. Because the dev and test sets are from LibriSpeech, this allows to compare distant supervision directly with SOTA supervised models. More details on dataset and metrics in Supplementary Section S1.
A pretrained CPC system plus a linear classifier trained on limited amounts of labels compared to the same system trained from scratch (PER).
|Supervised systems (LibriSpeech 1000 h)|
|CPC pretrain + CTC fine-tuning + 4gram-LM|
|MFSC + TDS + CTC + Grapheme + 4gram-LM|
|+ 60k pseudo-label||78.6||86.5||77.2||86.3|
|+ 60k pseudo-label||30.5||55.8||30.1||57.2|
|MFSC + TDS + CTC + Phoneme + 4gram-LM|
|+ 60k pseudo-label||84.3||90.0||84.0||90.5|
|+ 60k pseudo-label||30.0||55.8||29.3||56.6|
4 Baseline Systems
In the unsupervised setting, we use a PyTorch implementation of the Contrastive Predictive Coding (CPC) system[vandenoord2018]
trained to predict the hidden states of N future speech frames and containing an encoder, a sequence model, and a predictor. The encoder maps waveforms to hidden states (one 512 dimensional embedding every 10 ms frames) using a stack of 5 convolutional layers. The sequence model encodes the hidden states into a 512-dimensional phonetic embedding with one layer of Gated Recurrent Units (GRUs). The predictor maps the last phonetic embedding onto a future hidden state using a linear projection (one linear projection per time step, varying from 1 to 12). To avoid collapsing to a trivial solution, the model is trained discriminatively; the loss function aims at decreasing the dot product between predicted and actual future frames while increasing it for frames belonging to negative sequences (distant time windows). We used a reimplementation of the original paper, which we modified to increase stability and performance, as we could not reproduce the original paper results with the provided descriptions. These changes are as follows: we replaced batch-norm with channel-wise normalization, we reduced the hidden and phonetic embeddings to 256 dimensions, used a LSTM instead of a GRU, and used a 1-layer transformer network instead of a linear projection. The original paper obtained 65.5% accuracy on phoneme classification with a linear classifier trained on top of the frozen system’s phonetic embedding. Our modified system obtained 68.9% accuracy, while using 4 times fewer parameters in the encoder+sequence model part of the system. We trained it on the three cuts (unlab-600, unlab-6k and unlab-60k).
In the semi-supervised setting, we use our baseline pretrained CPC system supplemented with a linear classifier trained with CTC loss on the limited-resource set’s phone labels (only the linear layer is fine-tuned). We also provide a from-scratch control with the same architecture trained end-to-end.
For the distant supervision setting, we run two experiments: (1) we use our pretrained CPC system with an improved CTC layer (LSTM) which we fine-tune with orthographic labels in the limited-resource set. We decode with a python wrapped version of the wav2letter++ decoder [wav2letter++], using a 4-gram KenLM [heafield2011kenlm] language model trained on the unaligned text set. (2) We use CTC to train small Mel-filterbanks-based TDS systems[hannun2019tds]
, (7 TDS blocks, 20M parameters, total stride 2) on phonemes/graphemes respectively. We create pseudo-labels by beam-search decoding the 60k-hours unlabelled data with a 4-gram KenLM decoder trained on LibriSpeech-LM. These labels are used to train larger TDS systems (11 TDS blocks, 37M parameters) from scratch which generate WERs when decoding with the same LM. More details on baselines in Supplementary SectionS2.
The results for the unsupervised setting are shown in Table 2. CPC constructs embeddings with good ABX scores compared to an MFCC baseline, and are in the same range as the SOTA in the Zero Resource Speech Challenge 2017 for English (6.2% within and 8.7% across [heck2017]). The results in the semi-supervised setting (Table 3) show gains in PER in using unsupervised pretraining for several different amounts of fine tuning. The results on the distant supervision (Table 4), while far from supervised state-of-the-art, show that increasing the amount of unsupervised pretraining helps. Pseudo-labels are beneficial but only if the generating and fine-tuned models are initially good (Table 3 and 4).
We have introduced a new large dataset for benchmarking ASR systems trained with limited or no supervision. We found that unsupervised training with increasingly larger dataset yield better features and can significantly boost the performance of systems trained with limited amounts of labels (from 10 min to 10 hours) both for a phoneme recognition task in a semi-supervised setting and for a word recognition in a distant-supervision setting. The baselines were not particularly optimized for the tasks and are provided only as a proof-of-concept; there is a significant margin with fully-supervised systems. Obvious improvements include using larger models, speaker-adversarial losses, fine tuning the entire system (not just the top layers), and pseudo-labels retraining in all settings. Active learning[hakkani2002active] could further select useful parts of the dataset (we have provided SNR data to facilitate this effort). Yet another approach might apply language modeling techniques directly on unlabelled audio to improve the representations before fine-tuning them [chung2018speech2vec, anonymous2020vqwavvec].
S1 Supplementary Datasets and Metrics
s1.1 Datasets and meta-data
The dataset is constructed according to the following pipeline: data download, exclusion of bad files, conversion to flac, extraction of VAD, SNR, and Perplexity, the construction of JSON files and the split in three cuts of varying sizes.
Voice Activity Detection is accomplished using a TDS acoustic model [hannun2019tds] trained using CTC loss [graves2006connectionist] on the LibriSpeech dataset using the orthographic transcription. The trained model was used to perform inference (greedy frame-by-frame decoding) with the wav2letter++[wav2letter++] audio analysis pipeline on the unlabeled audio by mapping all of the letters to SPEECH and the silence symbol to NONSPEECH. The posterior SPEECH probability is added as meta data on the JSON of each file.
The TDS model used for VAD has 100 million parameters and consists of clusters of 2, 3, 4, and 5 TDS blocks separated by 2D convolutions. The duration of audio contained in each label is dependent on the stride of the underlying acoustic model; the model used has a stride of 8. The model is trained with word-pieces using the recipe outlined in [hannun2019tds].
The Signal-to-noise (SNR) ratio is calculated using the VAD labels predicted above. For each 80ms frame the VAD will return a posterior of whether the frame is speech or not. We decided to take 0.8 as the speech threshold, and 0.995 as noise threshold. If a speech frame is detected, we also automatically include 2 subsequent frames, to compensate for the spiky predictions from the VAD model. Any other frames that does not belong to either bucket are ignored since we are not confident whether they are speech or noise. Finally we compute the SNR ratio by using its definition
Perplexity was obtained by performing beam search decoding of the trained TSD model defined above supplemented by a 4gram word Language Model trained on LibriSpeech LM. It was computed as the mean of the log probability of the posterior on each file.
s1.1.4 JSON files and splits
To construct the JSON, we duplicated the metadata of the original book JSON file from LibriVox into each of the files associated for a given book (including unique book ID and speaker ID, and we added tags for SNR, perplexity and our own macro-genre tags, by folding the existing ones into 7 categories: ”Literature”, ”Science, Craft & Essay”, ”Ancient”, ”Religion”, ”Poetry”, ”Theater”, and ”Undefined”. We also added VAD information as a list of onsets and offsets of voice activity.
The files were splitted into cuts of different sizes by trying to maintain the same distribution of macro-genres in the three cuts.
s1.2 The ABX metric
Given a frame-wise distance metric, we want to check that features coding for the same phonemes have a more similar representation than features coding for different phonemes. To quantify this property, we use a minimal pair ABX task as defined in [Schatz2013ABX]; given a set of sounds from a category and a set of sounds from a category we compute:
Here, is the probability of a sample from category to be closer to another element from than to one from. To compute ABX we average the error over all categories.
S2 Supplementary baseline models
s2.1 CPC model and training
To train a feature model in an unsupervised fashion, we used the method implemented by Riviere and al. in [universal_speech_features], inspired by the Contrastive Predictive Coding algorithm (CPC) presented in [vandenoord2018]. We will briefly introduce the algorithm in this section, though we refer the reader to the original papers for more details.
s2.1.1 Contrastive Predictive Coding
CPC relies on forward modeling: given an input sequence of features, we try to predict the future representations of the sequence. The network must discriminate each future ground truth feature from negative examples randomly sampled in the batch.
More precisely, the model goes like this:
The raw waveform goes through a convolutional network , resulting in a feature sequence .
Then we form the current phoneme representation by applying a recurrent network to .
Finally, we predict from using a prediction network .
When using theses feature for another task we always consider , the output of the recurrent layer.
s2.1.2 Architecture Details
, we use five convolutional layers with strides [5, 4, 2, 2, 2], filter-sizes [10, 8, 4, 4, 4] and 256 hidden units with ReLU activations. Besides, the features are normalized channel-wise between each convolution. In the end, this network has a downsampling factor of 160, meaning that on akH audio input each feature will encode ms of data. Furthermore, is a one-layer LSTM with also a dimensional hidden state. Finally, the predictor is a one-layer transformer.
s2.1.3 Training Details
We considered input sequences of ms gathered in batches of sequences per GPU with a total of GPUs. Our training took approximatly two days on NVIDIA Tesla V100-SXM2-16GB. Besides, in a given batch, all sequences were sampled within the same speaker.
s2.2 TDS model and training
Given the limited amount of supervised training data, we select to use a smaller TDS model [hannun2019tds] with 20 million parameters. The model has a stride 2 in the first convolution, and three groups of TDS blocks with channels (10, 14, 18) and (2, 2, 3) blocks in each group. While on the whole 60k hours training data, we use the original architecture introduced in [hannun2019tds] with 37 million parameters. The only difference is that we reduce the overall stride from 8 to 2. We use dropout to prevent over-fitting, and its value is set to 0.4 and 0.1 in the 20M and 37M models.
In terms of model optimization, we use plain SGD with momentum. The initial learning rate and momentum are set to 0.1 and 0.5 respectively. In the supervised setting, the models are trained for 1500 epochs on 8 GPUs in total with learning rate halved after each 200 epochs and total batch size 64 and 16 for character- and phone- based system. In the semi-supervised scenario, the models are trained for 150 epochs on 32 GPUs in total with learning rate halved after every 30 epochs and total batch size 256.
The beam-search decoding parameters are tuned on only dev-other for the 20M TDS models to generate pseudo-labels, while they are tuned independently for the final 37M model. The same official LibriSpeech 4-gram LM is used in both decoding procedures. The decoding beam size is 1000 in all the experiments.
S3 Supplementary Results
s3.1 Pseudo-labels experiment
|MFSC TDS + train-1h||44.4||57.7||55.6||65.9|
|+ 60k pseudo-label||57.6||68.1||59.5||72.3|
|MFSC TDS + train-10h||22.5||40.2||22.2||41.3|
|+ 60k pseudo-label||18.7||36.0||18.7||38.6|
|MFSC TDS + train-1h||44.3||53.9||46.9||55.4|
|+ 60k pseudo-label||43.4||52.6||43.2||53.9|
|MFSC TDS + train-10h||20.7||36.4||21.8||38.0|
|+ 60k pseudo-label||15.0||31.1||14.7||32.2|
Table S1 provides PER/CER in the distant supervision setting with models trained on pseudo-labels. The TDS models above and generated pseudo-labels are trained and generated with the exactly the same procedure introduced in Section 4. Note that the PER/CER results above are not comparable to the semi-supervised ones in Table 3 as pseudo-labels here are generated with the official LibriSpeech LM, whose training set is a super-set of the transcriptions in the supervised training set.