BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

by   Yu Zhang, et al.

We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled data. In particular, on an ASR task with 34k hours of labeled data, by fine-tuning an 8 billion parameter pre-trained Conformer model we can match state-of-the-art (SoTA) performance with only 3 significantly improve SoTA with the full training set. We also report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks that cover a wide range of speech domains and span multiple orders of magnitudes of dataset sizes, including obtaining SoTA performance on many public benchmarks. In addition, we utilize the learned representation of pre-trained networks to achieve SoTA results on non-ASR tasks.


Efficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition

Self-supervised learning (SSL) is a powerful tool that allows learning o...

Prediction of Listener Perception of Argumentative Speech in a Crowdsourced Dataset Using (Psycho-)Linguistic and Fluency Features

One of the key communicative competencies is the ability to maintain flu...

Exploring Capabilities of Monolingual Audio Transformers using Large Datasets in Automatic Speech Recognition of Czech

In this paper, we present our progress in pretraining Czech monolingual ...

How Does Pre-trained Wav2Vec2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications

Recent work on self-supervised pre-training focus on leveraging large-sc...

Toward domain-invariant speech recognition via large scale training

Current state-of-the-art automatic speech recognition systems are traine...

Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

This paper is a study of performance-efficiency trade-offs in pre-traine...

Scaling ASR Improves Zero and Few Shot Learning

With 4.5 million hours of English speech from 10 different sources acros...

I Introduction

Semi-supervised learning (SSL), which uses unlabeled data to enhance the performance of labeled tasks, has recently played a crucial part in improving public automatic speech recognition (ASR) benchmarks. A combination of pre-training [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13] and self-training [14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25] methods have been utilized to enable deep networks to push the state-of-the-art (SoTA) performance on public ASR datasets [12, 23, 26].

The dominant setting for semi-supervised learning has been in the domain of audio books. The Libri-Light dataset

[27], which contains 60k hours of audio is by far the largest unlabeled semi-supervised dataset that has been used to improve the performance on LibriSpeech [28] and sub-sampled subsets thereof [27, 12, 23, 26, 29]. Despite the success and exciting developments in this domain, this setting for semi-supervised learning is limited in a few aspects. First, the unsupervised data is tailored to the supervised task and pre-trained models on Libri-Light has shown limited generalization capacity to different domains in some instances [30]. Second, the Libri-Light dataset is not much bigger than industrial-scaled labeled datasets. Third, the supervised tasks considered are much smaller compared to practical tasks on which the performance of the network needs to be improved.

In this report, we study the utility of large models, with the parameter count ranging from 600M to 8B, pre-trained and self-trained on extremely large and diverse unlabeled datasets containing hundreds of thousands to a million hours of audio. More precisely, we construct:

  • P-models: Models pre-trained on large unlabeled datasets.

  • PS-models: Models pre-trained and self-trained with large unlabeled datasets.

These models are in turn utilized to improve various labeled downstream tasks. We compile an extensive list of downstream tasks with audio data ranging from tens of hours to tens of thousands of hours across a wide variety of domains and languages. We focus on three different classes of downstream training methods:

  • Training P-models on labeled datasets.

  • Downstream self-training with P-models.

  • Fine-tuning with PS-models.

The first two training methods only employ unlabeled data in addition to the labeled data of the downstream task. Meanwhile, PS-models are self-trained upstream and additional labeled data to that of the downstream task is used for their construction.

For the rest of the section, we highlight key findings, present an overview of the paper and comment on related work.

I-a Key Findings

Fig. 1: (Left) WERs (%) of P-models trained on subsets of Voice Search. (Middle) SoTA results on public and non-public ASR benchmarks. (Right) SoTA results on public audio classification tasks. Accuracy measures presented are classification accuracy for Voxforge/SAVEE/Crema-D, unweighted average recall for Masked Speech, and mean average precision (mAP) for AudioSet. SoTA for AudioSet is selected from non-ensembled results using audio data exclusively. Axes are log-scaled.

SSL + Large Models = Labeled Data Efficiency: By scaling up the model size and utilizing semi-supervised learning techniques with a large amount of unlabeled data, we vastly improve labeled data efficiency. The first panel of Figure 1 shows the performance we achieve, without the use of additional labeled data, by training our models on 100h, 1000h subsets of the 34kh training set of the English (US) Voice Search task (VS). We obtain comparable results with reported SoTA performance [31] by using only 3% of the labeled data.

SoTA results for downstream ASR tasks: We exceed or match state-of-the-art results by fine-tuning the pre-trained models on a wide variety of downstream ASR tasks, as summarized in the second panel of Figure 1. Results for all downstream ASR tasks studied are collected in Section IV.

SoTA results for downstream non-ASR tasks:

We have trained shallow classifiers on top of features derived from pre-trained large encoders for audio classification. By doing so, we are able to achieve SoTA on multiple public benchmarks as presented in the last panel of Figure

1. Complete results are presented in Section VI.

Benefits of using SSL + Large Models are smaller for bigger downstream tasks, but are still significant: The gains achieved by increasing model size, pre-training and self-training have diminishing returns with larger labeled dataset size as is shown in Figure 2. Nevertheless, we are able to observe meaningful gains for industrial-scaled tasks.

Fig. 2: WERs (%) of models trained on subsets of Voice Search. On the left, we show the performance of the 600M-parameter model (the "Conformer XL") with varying preparation methods, while on the right, we report that of P-models of varying sizes.

I-B Outline

The outline of this report is as follows:

Methods: We use the Conformer [32] architecture as the speech encoders. We train 600M, 1B and 8B-parameter Conformer encoders using wav2vec 2.0 pre-training [12], and self-train and/or fine-tune RNN-T [33] or CTC [34] models having this encoder.

Model Preparation with YouTube: We use YouTube-based large-scale data to pre-train or self-train the Conformer models. A 1M-hour unlabeled dataset, which we denote YT-U, is used for pre-training, while a 500k-hour filtered unlabeled dataset, denoted YT-T, is used for self-training. P- and PS-models trained using these datasets are constructed.

ASR Tasks: We fine-tune the P- and PS-models on various ASR tasks and improve their performance. We are able to match existing benchmarks on the Voice Search task using 3% of the full data, and significantly improve the performance of the full task by pre-training. We are also able to achieve SoTA/near-SoTA performance on YouTube and public datasets.

Experiments with Voice Search: We fine-tune the P- and PS-models on the 34k-hour English (US) Voice Search dataset [35] and its 100h, 1000h subsets. We run control experiments studying the effect of the labeled dataset size, model size, pre-training, upstream and downstream self-training. Cross-lingual benefits of pre-training are also explored.

Non-ASR Tasks: We use features derived from the intermediate layers of the P-models for non-ASR tasks. We achieve SoTA performance for multiple tasks within the non-semantic speech (NOSS) benchmark [36] by directly using these features with linear models only. For the AudioSet benchmark [37], which involves a wide variety of non-speech sounds, we find intermediate Conformer layers pre-trained on the non-labeled native dataset, rather than YT-U, to yield SoTA results.

Discussions and Future Directions: We comment on some noteworthy observations and discuss possible future directions.

Despite the unifying theme of employing semi-supervised learning methods to train large models, the particulars of the experiments are varied. This variation is due to the fact that the tasks explored in this work have different pre-existing set-ups with limited budgets for experimentation. We describe the important elements of each experiment in the corresponding section and provide additional details in the appendix.

I-C Related Work

Our work is an extension of a host of recent research efforts [27, 12, 23, 26, 29] that have studied semi-supervised learning [38, 39, 40]

for ASR in the context of deep-learning. Our main contribution is that we have scaled up pre-training and self-training both in terms of model size (8 billion parameters), unlabeled dataset size (a million hours of audio) and labeled dataset size (34 thousand hours of audio) and:

  • Conducted a systematic study of the effect of pre-training, upstream/downstream self-training and model size on downstream tasks of varying sizes.

  • Fine-tuned the prepared models on seven public and six industrial ASR datasets spanning multiple speech domains, languages and accents.

  • Studied the utility of the pre-trained representations by using it for downstream non-ASR tasks.

We have used Conformers [32] as the encoder architecture in this work. The pre-training method used in this work is based on wav2vec 2.0 [12], while the self-training method is based on noisy student training [41] using SpecAugment [42, 43]. These methods have been employed in the context of LibriSpeech in [26]. There is extensive literature on pre-training [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13] and self-training [14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25] in ASR, a subset of which we list in the bibliography. Methods for improving the performance of streaming ASR models using models with future context have been studied in [44, 45, 46, 47, 48, 49, 50, 51, 52]. Multi-domain training, which has been used for training on multiple public datasets in this report, has been studied in the context of ASR in [53, 54, 55, 30, 35].

Giant models have been studied predominantly in the context of natural language processing

[56, 57]. Various methods have been employed for making giant models practically trainable [58, 59, 60, 61, 62]. We have used the GShard [62] framework with the GSPMD backend [63] to scale our ASR models up to 8B parameters.

Ii Methods

The methods employed for experiments in this report largely follow that of [26]. We review the key components here for completeness, while more details for each experiment can be found in the appendix.

Ii-a Model Architecture: Conformer

We use the Conformer [32], the convolution-augmented transformer, as the encoder network for our ASR models. The key component of the Conformer is the Conformer block, which consists of attention, feed-forward and convolutional modules [32]. As depicted in the left panel of Figure 3, the input mel-log spectrogram to the network is subject to convolutional sub-sampling, after which a series of Conformer blocks and a projection layer are applied to obtain the final features.

Fig. 3: The Conformer encoder and wav2vec 2.0 pre-training.

These features are either used as input to an RNN transducer [33] along with a 2-layer LSTM decoder, or used as input for a connectionist temporal classification (CTC) model [34] after an additional projection layer.

We consider three models with 600M, 1B and 8B parameters in this work, the particulars of which are listed in Table I. The convolutional kernal size for all these models are set to 5. Following the notation of [32, 26], we denote the three models, Conformer XL, Conformer XXL and Conformer G. We use relative attention [64] for these models.

Model # Params (B) # Layers Dimension Att. Heads
Conformer XL 0.6 24 1024 8
Conformer XXL 1.0 42 1024 8
Conformer G 8.0 36 3072 16
TABLE I: Conformer model parameters.

Ii-B Pre-training: Wav2vec 2.0

To pre-train the Conformer encoder network, we employ wav2vec 2.0 [12] training, as depicted in Figure 3

. After first extracting encoded features from the convolutional sub-sampling layer of the network, we pass the features through the rest of the Conformer model after masking them to generate context vectors. These context vectors are trained to agree with the target context vectors, obtained by applying a linear layer to the initial encoded features, by a contrastive loss


. The convolutional subsampling layer consists of two 2D convolutional layers applied with strides (2, 2).

Ii-C Self-training: Noisy Student Training

Noisy student training (NST) [41, 23] is a self-training method where a teacher model generates pseudo-labels for a large unlabeled dataset, which is in turn used to train a student model with augmentation.

The experimental set-up in this report differs from previous works including [23, 26], where the teacher model has been fused with a language model to generate better labels. In this work, our unlabeled dataset being very large, we choose to carry out inference without language model fusion, as the inference speed of the fused model is significantly slower than that of the ASR model on its own.

Unlike the fused model, the loss computed by the ASR model has a straightforward interpretation as a confidence measure—we thus can use a simple confidence-per-word measure to filter the teacher-generated transcripts. In our experiments, we either choose to retain the entire pseudo-labeled set, or filter 50% of the utterances based on confidence-per-word.

For some experimental set-ups, we choose to use a small model that has already been trained on a labeled dataset for a given task as a teacher model, rather than using a scaled-up family of models as commonly done in the literature [41, 23]. We have also chosen to only proceed with one generation of NST training when it is employed to observe its effects, rather than optimize performance by going through many generations.

In conclusion, given a labeled dataset , an unlabeled dataset , our NST procedure is as follows:

  1. Use teacher model to generated pseudo-labeled dataset . may be a model trained only with , or one that has been pre-trained or self-trained.

  2. (Maybe) filter using confidence-per-word.

  3. (Maybe) mix dataset and into new training set.

  4. Fine-tune new pre-trained model with augmentation on training set.

  5. (Maybe) set and repeat.

Ii-D Gshard/GSPMD: Making 8B-parameter Models Trainable and Efficient

We use the GShard [62] framework with the GSPMD backend [63] to train the 8B model on Cloud TPUs. In particular, the GSPMD-style pipeline parallelism works very well. It is because 1) the model has many layers but each layer is not very large, which makes pipeline parallelism more efficient than other forms of model parallelism in terms of inter-device communication cost; 2) each pipeline stage is smaller than the full program, reducing the overall compilation/startup time; 3) we only need to pipeline the Conformer blocks, and GSPMD allows us to conveniently switch to data parallelism for the layers before and after.

Ii-E Training Details

Data Processing: The audio in this work has been uniformly sampled to 16 KHz quality—any audio with a different native sampling rate is either up-sampled or down-sampled. The audio is featurized into 80-dimensional log-mel filterbank coefficients. Two tokenization schemes are used for the transcripts: graphemes or word-piece models (WPMs) [65], the details of which differ for each experiment.

Pre-training: The masking parameters for wav2vec 2.0 pre-training are taken from [12]

, where the starting point for the mask is chosen randomly with probability 0.065, and the mask size is set to 10 steps. The transformer learning rate schedule (section 5.3 of

[66]), parameterized by the peak learning rate and the warm-up steps is used universally. The Conformer XL is trained using Adam optimization with exponential moving averaging (EMA) with decay rate 0.9999. The XXL/G models are trained with Adafactor [67] optimization. These model do not use exponential moving averages during pre-training.

Training with Labels: In this work, we encounter three kinds of initialization conditions with labeled training, one where the entire network is randomly initialized, one where only the encoder portion has been trained with wav2vec 2.0 pre-training and the rest of the network is randomly initialized, and one where the entire network has been trained in some fashion. To handle all three cases, we use two separate optimizers for the encoder parameters and the decoder parameters of the network during labeled training. As with pre-training the Conformer XL is trained with Adam optimization, while the Conformer XXL and G are trained with Adafactor optimization. All networks are trained using EMA with decay rate 0.9999 for labeled training. The learning rate schedule for both the decoder and encoder optimizers for the XL/XXL models are transformer schedules parameterized by peak learning rate and warm-up steps, while for the G model, we use a constant schedule with a linear warm-up phase. The batch size, learning rate and warm-up steps are adjusted for the downstream tasks. We use the standard adaptive SpecAugment [42, 43] policy with two frequency masks with size parameter , and ten time masks with maximum time-mask ratio to augment the input spectrogram.

Iii Model Preparation with YouTube

We describe the procedure we use to prepare large Conformers using a large unlabeled dataset for downstream tasks in this section. The large unlabeled datasets that form the basis of our studies come from YouTube videos. The pre-trained and self-trained Conformers used to further train on downstream tasks for the rest of the paper are summarized in Section III-B.

Iii-a Data

We collect three datasets based on YouTube, that is used to pre-train, self-train and train our models for downstream tasks, and tasks native to YouTube:

  • YT-L: 350k hours of segmented, weakly-labeled audio, combined with 1000 hours of labeled audio.

  • YT-T: 500k hours of segmented, pseudo-labeled audio.

  • YT-U: 900k hours of segmented, unlabeled audio.

In constructing and pseudo-labeling the datasets, an important role is played by an RNN-T model with a bi-directional LSTM encoder. This 100M-parameter model will be referred to as the "YT baseline model" throughout this section.111More details on this particular model can be found in section 3.1 of [68]. The audio segmentation is carried out by conducting inference with this baseline model, which is used to identify the speech boundaries. The model is further used to pseudo-label the YT-T dataset.

YT-L: YT-L is a combination of a weakly-labeled dataset whose method of construction has been elaborated on in [69] and an additional 1000 hours of labeled audio. The weakly labeled portion of the dataset is based on audio from videos that have user-uploaded transcripts, where "islands" of the audio and the transcripts are selected where the transcripts are thought to well-represent the audio. This is done by first force-aligning the transcripts and audio and finding islands of high confidence using a pre-existing acoustic model. 350k hours of audio with transcripts are obtained this way.

YT-T: YT-T is a dataset also with audio from videos that have user-uploaded transcripts. These videos are further segmented using the YT baseline model, and the non-speech segments are removed, leaving 500k hours of audio. The user-provided transcripts of this dataset, however, are discarded and are not used for training. Instead, we choose to generate pseudo-labels on this dataset using the YT baseline model trained on YT-L when we use it for labeled training.

YT-U: YT-U is built by first randomly collecting 3 million hours of audio from "speech-heavy" YouTube videos, including lectures, news and interviews, filtered by language. The 3 million hours of audio is then further segmented by the YT baseline model. The non-speech segments identified by the YT baseline model are removed to yield approximately a million hours of unlabeled audio data.

The test set for the YT domain is generated by hand-transcribing popular videos from YouTube with 11 hrs of audio with lengths 2 - 10 min.

Besides the obvious advantage of having a very large amount of audio available, these YouTube-based datasets have an extremely wide range of sub-domains [54, 69], plotted in Figure 4 for YT-U. This variety will prove to be beneficial in various downstream tasks.

Fig. 4: Video categories by length (outer) and number (inner).

Iii-B Pre-trained and Self-trained Models

We produce Conformer models that have been pre-trained and self-trained with YouTube-based data that will be employed for downstream tasks in the rest of the paper. The notation


is used to denote a Conformer of size "Size" with decoder-type "Decoder" prepared with method "Preparation." The following options are available for each parameter:

  • Size: XL, XXL or G.

  • Decoder: CTC or RNNT.

  • Preparation: Null (no preparation), P (pre-trained) or PS (pre-trained and self-trained).

Pre-training is done via wav2vec 2.0 with YT-U, the details of which are presented in Section II-E. This process exclusively prepares the encoder portion of the model.

Meanwhile, the PS-model is produced by taking the model pre-trained with YT-U and training it on the pseudo-labeled dataset YT-T. While the PS-model never sees the labeled dataset YT-L, it implicitly uses the information, since YT-T has been pseudo-labeled by the YT baseline model trained on YT-L. A 4k-WPM model is used to tokenize the text for PS-models. Self-training produces a strong upstream model that can be fine-tuned on smaller tasks. The performance of the PS-models on the YT test set is presented in Table II.

Iv ASR Tasks

In this section, we fine-tune the models pre-trained and self-trained with YouTube data on downstream ASR tasks. We compare the performance of our fine-tuned models with existing benchmarks, and show that they are able to improve state-of-the-art results on benchmarks spanning a wide range of dataset sizes and domains. We first present the effect of pre-training and self-training on the native YouTube task, and move on to presenting the improvements on the English (US) Voice Search task we were able to achieve both in terms of performance and efficiency. We then present our results on two public tasks, SpeechStew [30] and CHiME-6 [70] and compare them to the current SoTA benchmarks. We also apply our methods to a non-public Telephony task and show that we are able to improve the performance of a streaming model by using a fine-tuned Conformer PS-model as a teacher.

Iv-a YouTube

Data: YT-L, which is partially labeled and partially weakly-labeled, is used as the supervised dataset for the YouTube task. Meanwhile we utilize YT-T as the unlabeled dataset to be pseudo-labeled by the teacher model. As described before, the YT baseline model trained on YT-L is taken to be the teacher for the pre-trained Conformer models.

Results: We have presented our results of training pre-trained Conformer CTC and RNN-T models with labeled YT data against existing baselines in Table II. For self-training, the student Conformer model is trained with YT-T pseudo-labeled by the YT-baseline model. The pseudo-labeled data is neither filtered nor mixed with YT-L for training the student. Quite remarkably, we find that training our models entirely with machine-generated transcripts turn out to show better performance than with the default labeled dataset.

Model YT-test Model YT-test
SoTA [68] 9.1
Baseline Model 8.4
ConformerXL-CTC-P 8.6 ConformerXXL-CTC-P 8.5
+ Self-training (PS) 7.9 + Self-training (PS) 7.5
ConformerXL-RNNT-P ConformerXXL-RNNT-P
+ Self-training (PS) 7.8 + Self-training (PS) 7.8
TABLE II: WER (%) on test sets after pre-training and self-training Conformer CTC and RNN-T models. We denote the models trained with pre-training + self-training as PS-models throughout the paper (see section III-B.)

Iv-B English (US) Voice Search

Data: The English (US) Voice Search (VS) dataset contains 34k hours of labeled voice search audio [35]. To test our ability to improve data efficiency, we construct random 1000 and 100 hour subsets, which we denote VS-1000h and VS-100h. For this data, we are able to utilize a 128M-parameter Conformer language model trained on a relevant text corpus, which is used for improving the ASR model performance via shallow-fusion [71] for RNN-T models. Further information on the dataset can be found in Section V-A.

Overview: We have conducted a systematic study of the effect of model size, labeled dataset size, pre-training, language model fusion, upstream and downstream self-training for training Conformers on VS, which we present in Section V.

Main Results: Our best results for each subset of Voice Search are obtained by training the ConformerG-P. For the 100h subset, we have applied NST and LM fusion to obtain our best result, while for the 1000h subset, LM fusion is sufficient (see Section V-F). For the full dataset, neither NST nor LM fusion pushed the performance further. The best WERs obtained have been presented in Table III and plotted in the first panel of Figure 1 in the introduction.

VS-100h VS-1000h VS-34kh
SoTA [31] 4.8
Our Results 6.9 5.0 4.1
TABLE III: English (US) Voice Search test WERs (%) from training on 100h, 1000h and 34kh subsets of VS.

Iv-C SpeechStew

Task AMI Common Voice LibriSpeech Switchboard/Fisher TED-LIUM WSJ
IHM SDM1 clean other SWBD CH eval92
Prior Work
 SoTA 9.0 [30] 21.2 [72] 8.4 [30] 1.4 [26] 2.6 [26] 4.3 [73] 6.8 [73] 5.2 [55] 1.3 [30]
 ConformerXXL-LibriLight [30] 9.5 22.7 8.4 1.7 3.3 4.8 10.6 5.7 1.3
Our Work
 ConformerXXL-RNNT-P 8.6 17.7 7.8 1.9 3.5 4.6 10.2 5.9 1.3
  + Downstream NST 7.8 18.3 7.7 1.9 3.7 4.5 8.2 5.2 1.6
     (Non-filtered) (9.8) (22.7) (11.5) (2.9) (6.6) (5.2) (9.1) (5.0) (4.0)
 ConformerXXL-RNNT-PS 8.3 19.5 8.8 2.1 4.1 4.8 8.4 5.0 1.6
TABLE IV: WERs (%) across multiple tasks for multiple settings compared against pre-existing baselines. We present the performance of the fine-tuned ConformerXXL-P model as well as the result obtained by applying an NST loop starting with this model using the YT-T data as the unlabeled dataset. The result from fine-tuning the PS-model trained upstream with pseudo-labeled YT-T data (see Table II) are presented in the last row. Evaluated with punctuation removal following [55]. Evaluated after removing <unk> tokens. <unk> token removal affects TED-LIUM performance only.

Data: The SpeechStew [30] dataset is assembled by putting together seven public speech corpora—AMI [74], Common Voice [75], English Broadcast News222Linguistic data consortium (LDC) datasets LDC97S44, LDC97T22, LDC98S71 and LDC98T28., LibriSpeech [28], Switchboard/Fisher333LDC datasets LDC2004T19, LDC2005T19, LDC2004S13, LDC2005S13 and LDC97S62., TED-LIUM v3 [76, 77] and Wall Street Journal444LDC datasets LDC93S6B and LDC94S13B.. All utterances from these datasets are collected and mixed randomly and batched for training—no additional steps are taken regarding balancing and mixing data from disparate datasets. The training, dev and evaluation sets for these datasets are process as in [30], where the transcripts have been prepared via Kaldi [78]. The inference results on the test sets are scored via corresponding Kaldi scripts, while for the Common Voice set, we take the extra step of dropping punctuation before evaluating the word error rates. We use a 1k-WPM constructed based on the LibriSpeech test set for training P-models, while the PS-models use the upstream 4k-WPM.

Overview: We have compiled key experimental results for SpeechStew in Table IV. As a baseline, we have listed SoTA word error rates for each task inside SpeechStew, and recorded the performance of the ConformerXXL RNN-T network pre-trained with Libri-Light data trained on SpeechStew [30].

Libri-Light vs. YT-U: We report the performance of the ConformerXXL-RNNT-P model trained on SpeechStew in the third row of Table IV. As in [30], the model is trained on the mixed SpeechStew data, without any data balancing or batch-wise mixing. By comparing the second and third rows, we see that pre-training on YT-U has significantly more benefits over pre-training with Libri-Light, as we find the gains achieved over AMI, Common Voice, Switchboard/Fisher to be significant, while performance lags on the LibriSpeech and TED-LIUM domains are comparatively small. This result is not surprising, since YT-U is a much bigger (1 million vs. 60k hours), diverse (see Section III-A) dataset.

Downstream Noisy Student Training: We experiment with downstream noisy student training, where we apply one NST loop with the ConformerXXL-RNNT-P model trained on SpeechStew. To do so, we pseudo-label the YT-T dataset with the SpeechStew-trained Conformer model, and filter 50% of the data using confidence-per-word of the generated transcripts. We then mix SpeechStew data with the pseudo-labeled data without any balancing. The result is recorded in the fourth row of Table IV, where we see significant improvement in the AMI-IHM, Callhome and TED-LIUM test sets, with small performance degradation on other test sets. We find filtering to be a crucial part of NST. Comparing the performance of models trained on pseudo-labeled datasets constructed with and without filtering (fourth vs. fifth row of Table IV), we find that without filtering, training with the pseudo-labeled data lead to severe degradation of performance across the board, save for a surprising performance improvement on TED-LIUM.

PS-models: We have also trained the ConformerXXL-RNNT-PS on the SpeechStew dataset. Deviating from the general trend observed in the rest of the paper, the PS-model does worse than the P-model across the board as can be observed by comparing the third and last rows of Table IV. We hypothesize the performance lag of the PS-model comes in part from differences in the text normalization of upstream and downstream labeled tasks. In contrast to PS-models, P-models can use WPMs native to the downstream task, and learn native text normalization conventions from the beginning. A manifestation of this disadvantage we have observed is that the fine-tuned PS-model has a tendency to produce <unk> tokens during pauses, resulting in showing 8.7% WER on the TED-LIUM test set—simply getting rid of these tokens led to a 3.7% absolute improvement in WER. This issue did not affect any of the other test set performances.

Iv-D CHiME-6

Data: CHiME-6 [70] contains 40 hours of distant microphone conversational speech recognition in everyday home environments. We use the official front-end enhancement recipe [70] to enhance the dataset—BeamformIt is used to create an augmented training set, while guided source separation [79] with 12 channels is used to enhance the dev/evaluation sets.

Overview: We have presented results from training P- and PS-models on CHiME-6 in Table V. As baselines, we have listed the performances of previous SOTA models and Libri-Light pre-trained models reported in the literature.

Model Pre-training Upstream Dev Eval
Prior Work
 HMM Baseline [70] - - 51.8 51.3
 HMM (SOTA) [80] - - 36.9 38.6
 ConformerXXL [30] Libri-Light - - -
 ConformerXXL [30] Libri-Light SpeechStew 31.9 38.9
Our Work
 ConformerXXL-P YT-U - 35.1 39.5
 ConformerXXL-P YT-U SpeechStew 26.2 34.4
 ConformerXXL-PS YT-U YT-T 26.2 31.0
TABLE V: WERs (%) on CHiME-6. We show the dataset used for pre-training and upstream labeled-training before fine-tuning the model on the CHiME-6 dataset. Recall that PS-models are trained upstream on the pseudo-labeled YT-T dataset.

Results: We have recorded the performance of the ConformerXXL-RNNT-P model directly trained on CHiME-6 in the third row of Table V, while that of the model first trained on SpeechStew and fine-tuned on CHiME-6 is recorded in the fourth row. Quite surprisingly, the P-model directly trained on CHiME-6 shows strong performance, in contrast to the Conformer XXL pre-trained with Libri-Light, which fails to directly train on CHiME-6. Upon training on the upstream task of SpeechStew and fine-tuning on CHiME-6, we are able to exceed state-of-the-art performance with 11% relative WER improvement on the CHiME-6 test set. The strongest CHiME-6 performance is achieved by fine-tuning the PS-model, resulting in a 20% relative WER improvement on the test set.

Iv-E Telephony

Data: We aim to improve an English (GB) telephony task with a training set obtained by mixing a labeled telephony dataset with 320 hours of audio and a video-based dataset with 30 hours of audio. We use two test sets to evaluate the model. Test-short consist of 9 hours of telephony audio, while Test-long consist of 82 hours of long video-based audio. The two test sets are chosen because we wish to construct a model that performs well on telephony audio, but at the same time be able to show good performance on long utterances.

Overview: Our objective will be two-fold. The first goal will be to train a large ASR model that does well on the Telephony task, while the next will be to distill its performance to a streaming model. As the base streaming model, we use the RNN-T model of [81] with an 8-layer uni-directional LSTM decoder with cell size 2048, and a 2-layer LSTM decoder with the same cell size trained with the large multi-domain dataset presented in [54]. As a baseline, we fine-tune this model with various mixtures of the telephony dataset and the video dataset. Our results are summarized in Table VI.

Model Fine-tuning Mixture Test-short Test-long
Telephony Video NST
Streaming Model
 Baseline N/A N/A N/A 33.11 15.53
 Fine-tuned 1.0 - N/A 22.45 21.41
0.8 0.2 N/A 22.64 19.99
ConformerXL-P 0.8 0.2 N/A 22.24 14.55
 Baseline N/A N/A N/A 27.20 10.97
 Fine-tuned 0.8 0.2 N/A 21.24 10.72
Student Streaming Model 0.8 - 0.2 22.97 16.75
TABLE VI: Telephony test WERs (%). The performance from fine-tuning the P- and PS-models are presented. Note that the PS-model, even before seeing the Telephony data, performs reasonably on the task. The fine-tuned PS-model is used to generate NST data for training the student streaming model.

Non-streaming Models: We are able to obtain models that perform better than the fine-tuned streaming models on both tasks by training the ConformerXL-RNNT-P and PS models. The PS-model exhibits improved performance on Test-short, while it is able to lower the Test-long WER by a significant amount compared to any of the streaming baselines.

Distilling to Streaming Models: We attempt to distill the performance of the fine-tuned PS-model to the streaming model by taking a random 20% subset of YT-U and pseudo-labeling it with the model. We apply 50% filtering according to confidence-per-word to generate the NST dataset that is in turn used to train the streaming model. The result of mixing this data with the labeled data to fine-tune the streaming model is given in the last row of Table VI. We are able to improve the Test-long performance of the streaming model by a relative 16% while suffering a minuscule performance loss on Test-short.

V Experiments with Voice Search

We now move on to conduct a series of experiments to explore the effect of pre-training, upstream and downstream self-training and model size scaling for downstream tasks of different scales. To conduct a systematic study, especially with respect to the effect of the scale of the downstream task, we choose to study the Voice Search task [35], which has a large amount of labeled audio, and produce tasks of varying scale by sub-sampling. By doing so we find that the combination of increasing the model size and utilizing a large unlabeled dataset vastly improves labeled-data efficiency.

V-a Data

English (US): Our principal dataset is the English (US) Voice Search dataset, containing 34k hours of labeled voice search audio [35]. As noted before, we sample random 1000 and 100 hour subsets, VS-1000h and VS-100h. The transcripts for the audio are tokenized either using graphemes or a 4k-token WPM. The Conformer G models are trained using WPM tokenization while grapheme tokenization is used for experiments with the XL/XXL model unless indicated otherwise. A 128M-parameter Conformer LM trained on a large corpus of in-domain text for improving the performance of the ASR models further.

Non-English: To explore cross-lingual benefits of pre-training, we examine three Voice Search tasks in non-English languages—Hungarian (HU), Chinese (TW) and Hindi (IN) [35]. For each language we prepare an unlabeled YouTube dataset segmented using voice activation detection (VAD [82]), the labeled Voice Search dataset and its 100h and 1000h subsets. The amount of unlabeled YouTube data and labeled Voice Search data are tabulated in Table VII.

Language YouTube (hrs) Voice Search (hrs)
Hungarian (HU) 400k 9k
Chinese (TW) 900k 20k
Hindi (IN) 800k 27k
TABLE VII: YouTube and Voice Search datasets.

V-B Pre-training

We compare the results from training the ConformerXL-RNNT from scratch on the VS-100h, 1000h and 34kh sets against training the pre-trained model ConformerXL-RNNT-P on these tasks. The results, plotted in the first panel of Figure 2 in yellow and green, illustrate the benefits of pre-training. We find that while the gains from utilizing additional unlabeled and labeled data exists even at very large downstream dataset size, the relative improvement in performance decreases.

V-C Scaling-up Model Size

We train RNNT-P models of three different sizes, XL, XXL and G on the VS-100h, 1000h and 34kh datasets. The 8B parameter G models are trained using WPM tokenization. The results are plotted in the second panel of Figure 2.

Meanwhile, we address the phenomenon observed in [26], where it was shown that for LibriSpeech, larger Conformer models performed worse unless they are pre-trained. Their results are plotted in the first panel of Figure 5.

To see if this pattern still holds for very large labeled datasets, we train our Conformers from scratch on the entirety of Voice Search and compare their performances with those of their pre-trained counterparts. We find that in our case, the trend of pre-training being necessary for benefiting from model size does not hold anymore in the 600M to 1B parameter range as is shown in the second panel of Figure 5. The 8B parameter model training fails to converge for this task.

Fig. 5: (Left) The LibriSpeech dev-other performance of Conformer models of varying size and pre-training conditions when trained on LibriSpeech 960h reported in [26]. (Right) Voice Search 34kh test performance of Conformer models.

V-D Cross-lingual Benefits

We explore cross-lingual benefits of pre-training by examining Voice Search tasks in Hungarian (HU), Chinese (TW) and Hindi (IN). For each language, we prepare three Conformer XL RNN-T models: a baseline model with no pre-training, a model pre-trained with English YouTube data and a model pre-trained with YouTube data in the native language. We train each model on the entire Voice Search set and its 100h and 1000h subsets. To make a fair comparison between cross-lingual pre-training and native pre-training, rather than using P-models trained on YT-U, we prepare an unlabeled English YouTube dataset also segmented by a VAD [82] with 926k hours of audio, and pre-train our models on this dataset.

Fig. 6: Test WERs (%) from training pre-trained Conformer XL RNN-T networks on non-English Search datasets and subsets thereof. Both axes are plotted in log-scale.

The results of the experiments are plotted in Figure 6. Consistent with the overall theme of this section, both English and native pre-training are more effective with smaller labeled dataset size. While we find cross-lingual benefits of pre-training at 100h and 1000h labeled datasets, we can see English pre-training hurting the performance when training on the full dataset. We can observe, however, that the benefit from native pre-training persisting up to full dataset size for Hungarian (HU) and Chinese (TW).

V-E Upstream Self-training

We fine-tune the ConformerXL-RNNT-PS model, pre-trained with YT-U and self-trained upstream on the YT-T dataset, on the VS-100h, 1000h and 34kh sets. These results, plotted in blue in the first panel of Figure 2, show further gains beyond the pre-trained model for the 100h and 1000h training sets compared to the baselines trained from scratch. The fine-tuned PS-model is not able to achieve better performance than the fine-tuned P-model when trained on the full VS dataset. The effect of upstream self-training on the XXL and G models remains to be investigated.

V-F Downstream Self-training and LM Fusion

We apply downstream noisy student training to VS-100h and 1000h with RNNT-P models. To do so, a teacher model is trained on the same labeled dataset and generates pseudo-labels for a random 20% subset of YT-U to train a student model with. 50% of the teacher-generated transcripts are filtered out based on confidence-per-word and mixed batch-wise with the labeled data. To further experiment with language model fusion, we use WPM tokenization for training the student models.

Experiment Teacher Model Teacher Tokens Teacher WER NST Ratio
XL on 1000h XL-CTC-P WPM 8.0% 0.6
G on 100h XXL-RNNT-P Grapheme 9.4% 0.4
G on 1000h XXL-RNNT-P Grapheme 6.2% 0.6
TABLE VIII: Settings for downstream NST experiments.

We have conducted three experiments, two where the ConformerG-RNNT-P is used as a student for the 100h/1000h tasks, and one where the ConformerXL is used as a student for the 1000h task. Some relevant information for these models is listed in Table VIII. Note that we have used a CTC instead of an RNN-T network as a teacher for the XL model so that its architecture does not exactly coincide with that of the student. The NST ratio indicates the ratio of teacher-generated transcripts within the student training batch.

G on 100h XL on 1000h G on 1000h
Baseline 8.8 7.7 6.5 6.1 5.5 5.0
+ NST 7.8 6.9 6.0 5.5 5.3 5.0
TABLE IX: Voice Search test WERs (%) for Conformer RNNT-P models with and without NST and LM fusion.

As noted in Section V-A, we utilize a Conformer language model for shallow fusion [71] with the trained ASR models. The fusion weight [71] and the non-blank reward [83, 84] are selected by a small random exploration. The result of applying NST and LM fusion is given in Table IX.

We have not been able to achieve additional gains on the full Voice Search task using downstream NST. Some discussion on this matter is given in the final section.

Vi Non-ASR Tasks

We now explore the utility of the representations of pre-trained Conformers for audio classification tasks. In this section, we consider the Conformer XL Non-RA, a Conformer XL model that does not use relative attention, which turns out to outperform its relative attention counterpart on these tasks.

Vi-a Non-Semantic Speech (NOSS) Benchmark

Tasks and Datasets: The Non-Semantic Speech Benchmark (NOSS) [36] is a benchmark of speech classification tasks that is used to compare the usefulness of speech representations. The benchmark includes a variety of tasks such as speech emotion recognition [85, 86], speaker identification [87], and language identification [88], but specifically excludes tasks that are focused on the meaning of words. We follow [89] and include two health-speech tasks in our representation evaluation: mask detection during speech [90] and environmental human sounds, like coughing and sneezing [91]. Table X describes the datasets.

Dataset Target Classes Samples Avg duration (s)
VoxCeleb [87] Speaker id 1,251 12,052 8.4
VoxForge [88] Language id 6 176,438 5.8
Speech Commands [92] Command 12 100,503 1.0
CREMA-D [85] Emotion 6 7,438 2.5
SAVEE [86] Emotion 7 480 3.8
Masked Speech [90] Mask wearing 2 36,554 1.0
ESC-50 human [91] Non-speech sounds 10 386 13.8
TABLE X: Non-Semantic Speech Benchmark (NOSS) datasets. Average duration is in seconds. This is a subset of VoxCeleb filtered according to YouTube’s privacy guidelines. This is a subset of ESC-50 with human sound labels.
Model VoxCeleb1 Voxforge
Previous SoTA - 95.4 [93] 97.9 [94] 74.0 [95] 70.0 [89] 73.0 [96] 93.9 [36] 80.6 [97]
 TRILL [36] 13.1 84.5 77.6 65.8 65.0 65.3 86.4 63.7
 FRILL [89] 13.8 78.8 74.4 71.3 63.3 67.2 87.9 68.7
 YAMNet [98] 9.6 79.8 78.5 66.4 69.2 59.6 93.9 62.7
 ASR Encoder [99] 5.2 98.9 96.1 71.8 85.0 54.4 75.8 63.7
Our Results
 ConformerXL-P 49.4 99.7 95.2 86.8 92.5 68.0 89.4 60.8
 ConformerXL-P Non-RA 50.3 99.7 97.5 88.2 92.5 73.4 89.4 72.5
 ConformerXXL-P 53.3 99.6 96.3 85.5 87.5 68.6 90.9 65.7
 ConformerG-P 48.9 99.8 90.1 87.1 90.0 61.3 70.3 54.9
TABLE XI: Accuracies (%) for NOSS tasks. A filtered subset of VoxCeleb1 according to YouTube’s privacy guidelines has been used. The Masked Speech task performance is reported using unweighted average recall [90] instead of accuracy. Audio and visual features used. Acoustic and textual features used. Layer 10 is used as in [36].

Evaluation: We modify the evaluation method described in [36]. For every (model, layer, task) triplet, we train three types linear models using the Scikit-Learn library [100]

(logistic regression, balanced logistic regression, linear discriminant analysis). Given the model and task, we select the layer and regression method that yields the maximum dev-set performance, and report the test-set performance of that configuration in table 

XI against previously reported state-of-the-art results. In the second block of rows in the table, we present the results of baseline NOSS task evaluations we have conducted using audio representations obtained by methods previously studied in the literature [36, 89, 99, 98].

  • We achieve new SoTA on 4/7 public tasks555We exclude our VoxCeleb results from this group, since we use a filtered subset that does not have associated public benchmarks. using only a task-specific linear layer (see table X). Linear models trained on the Conformer embeddings outperform previously reported results on Voxforge, CREMA-D, SAVEE and Masked Speech that were achieved by using complex, task-specific architecture and training.

  • Conformers on acoustic features alone are competitive with other models that use multimodal data. In particular, we achieve 20% relative improvement on the previous SoTA CREMA-D result obtained using visual and acoustic features. Our results are worse than SoTA on DementiaBank, which has been achieved by using additional textual information along with the acoustic features [97].

  • The Conformer XL without relative attention produces the best-performing features (see Table X) consistently. This is also confirmed by the analysis using the average accuracy measure presented below.

  • On most tasks, all four Conformer models outperform all previous SoTA numbers from the other non-semantic speech representations ("Baseline" rows in Table XI), with the exception of the ESC-50 performance of YAMNet, whose labeled classes of the supervised training set is a super-set of that of ESC-50.

Average Accuracy Measure: As in [89], we use the average accuracy measure over all NOSS tasks as metric for quantifying the general quality of an embedding. In figure 7, we plot the average accuracy measure as a function of the model layer, starting from the positional embedding layer indexed as layer -1. The curves for the Conformer models truncate at the penultimate layer of the model.

Fig. 7: Average accuracy measure for each Conformer model and layer. The model’s best layer is marked. The average accuracy measure for ASR Encoder [99], TRILL [36], FRILL [89], and YAMNet layer 10 [98] are marked as horizontal lines for reference.
  • We find that embeddings from the majority of Conformer layers outperform previous results. This suggests the general usefulness of the representations within the pre-trained Conformer models for non-semantic speech tasks, not just tasks related to language.

  • The best Conformer layer is in the middle of the network, not at the penultimate position. This suggests that even though the Conformer learned a new SoTA non-semantic speech representation, the training procedure could be modified for further improvement.

Vi-B AudioSet

Data and Overview: In the preceding sections, we have demonstrated the capability of our Conformer models in a wide variety of speech processing tasks. We now evaluate the utility of our unsupervised representation for the non-speech task of audio event classification. For this, we consider the commonly used AudioSet benchmark [37]

, which includes nearly 2 million audio clips, each with one or more labels drawn from an ontology of 527 classes. More specifically, we use the unsupervised representation evaluation protocol used by several recent papers. This involves 1) using our pre-trained Conformer models to extract a fixed-dimensional frame-level representation for each clip; 2) using AudioSet to train a simple frame-level classifier on top consisting of a single hidden layer with 512 ReLU units followed by a linear classification layer with 527 independent sigmoid outputs; 3) applying this to each evaluation clip and mean pooling the frame-level scores to report an overall mean average precision value. Past approaches that have used this evaluation paradigm have used the same AudioSet training data for unsupervised pre-training. Thus, in addition to reporting performance using our models pre-trained with YT-U, we also pre-trained an additional ConformerXL with AudioSet to limit domain mismatch and make results more comparable to past work. Table 

XII compares the AudioSet classification performance for several past approaches and various representations derived from the pre-trained ConformerXL (both with and without relative attention). For each Conformer model, we report both the performance at the output (layer 24) and at the best internal Conformer layer as determined using a small portion of AudioSet train held out as a development set.

Representation Pre-training Modalities mAP
Previous SoTA
 Multi-format [102] AudioSet A 0.329
 BRaVE [103] AudioSet A,V 0.347
 Output Layer YT-U A 0.192
 Best Layer (10) YT-U A 0.304
ConformerXL Non-RA
 Output Layer YT-U A 0.234
AudioSet A 0.222
 Best-Dev Layer (10) YT-U A 0.308
AudioSet A 0.340
TABLE XII: AudioSet shallow classifier benchmark results in mean average precision (mAP). Following standard practice for evaluating unsupervised representations [101], the representation is held fixed and the entire AudioSet train set is used to train an MLP classifier with a single 512-unit hidden layer. Some prior work uses both audio spectrograms (A) and video (V) as inputs during training where indicated. All results shown only use spectrograms as input for evaluation.

Results: We can make several conclusions from these results. First, we find that while the representation defined by the final Conformer output layer lags all past approaches, selecting internal layers performs significantly better, as was the case for the NOSS experiments above. This indicates the wav2vec 2.0 pretraining objective is not well matched to the audio event classification task, but that it still induces useful internal representations. Second, relative attention results in a small degradation of performance for this task, which is in line with NOSS and counter to the ASR results. Like the NOSS tasks, the AudioSet benchmark involves whole-clip level predictions, which we hypothesize likely reduces the value of relative positional information. Third, we find that even when pretraining on speech alone (YT-U), we still learn a representation at layer 10 that performs respectably on the AudioSet benchmark (0.308 mAP) compared with past approaches, even though this model has not been exposed to the diversity of the AudioSet ontology. However, when we retrain ConformerXL using in-domain AudioSet data, we achieve additional improvement. Our AudioSet-pre-trained ConformerXL performance of 0.340 mAP outpaces all past work that used spectrograms alone for training and evaluation. Most past unimodal approaches rely heavily on augmentation in the learning procedures, the types of which need to be carefully chosen to optimize for this task. However, the Conformer model obtains strong performance without using augmentation of any kind, instead driven solely by the architectural design and pre-training objective.

Vii Discussion and Future Directions

Vii-a ASR Model and Data Efficiency

Data Efficiency: We find that increasing model size, pre-training, upstream and downstream self training positively affects the training performance, but mainly in labeled data efficiency. When the labeled dataset size grows very large, the effect of many of these methods become smaller.

Pre-training and Training Stability for Large Models:

We find that training on labeled datasets becomes harder and more brittle as the model gets very large. The problem is exacerbated by the training cost of the larger models, making hyperparameter tuning prohibitively expensive. As a result, our 8B model becomes practically un-trainable on the full Voice Search dataset. Large model training becomes stable after they are pre-trained.

Model Compression: A natural direction of research that emerges from this work is to find ways to practically benefit from the performance gains achieved by giant models. Research on how to compress giant models with minimal performance loss will be a crucial element for such endeavors.

Vii-B Downstream ASR Tasks

P- or PS-models?: Since PS-models utilize additional upstream labeled data, it should be expected to perform better than P-models on small downstream tasks in general, which is what we find. Meanwhile, there are factors that make the PS-models behave worse. The PS-models are trained in a text environment (e.g., text normalization, tokenization) optimized for the upstream task, while the P-models can be trained on an environment native to the downstream task from the beginning, which might differ significantly from the upstream task. Furthermore, difference in these factors (e.g., tokenization) might make it unwieldy to use available downstream resources (e.g., LM fusion).

Downstream NST training for full Voice Search sets: Application of the standard downstream NST recipe used in this paper of pseudo-labeling either YT-U or YT-T, filtering 50% using confidence-per-word and mixing the pseudo-labeled data with labeled data, has not been able to improve full Voice Search set performance. Our experiments, which have been conducted with the English (US) and Hungarian (HU) full Voice Search set, have in fact lead to slight degradation of performance. A tentative conclusion can be that pseudo-labeling becomes less effective when the labeled dataset set size becomes very large, although a more through investigation would be needed to sharpen this assertion.

Vii-C Downstream Non-ASR Tasks

Beyond Encoded Features: We have restricted the use of large pre-trained models as feature encoders for non-ASR tasks in this work. It would be interesting to go beyond this use to further improve audio classification tasks in general.

Appendix A Experiment Details

A-a Pre-training

Some pre-training parameters are summarized in Table XIII. Pre-training has been carried out using Google Cloud TPU V3 chips.

Model Batch Size # TPU Cores Days Epochs Warm-up Steps Peak LR
XL 4096 512 5 10 25k 1e-3
XXL 4096 1024 8 10 25k 1e-3
G 1280 1024 18 4 50k 1e-4
TABLE XIII: Pre-training parameters.

A-B Voice Search

P-Models: When training XL/XXL P-models on VS-100h, 1000h and 34kh, a fixed transformer learning rate schedule is used for all three datasets, while the batch size is scaled up by a factor of 4 for each bigger task. The decoder learning rate schedule has peak learning rate 1e-3 and warm-up steps 1.5k for both models. Meanwhile, the encoder learning rate schedule has peak learning rate 3e-4/2.4e-4 for the XL/XXL models respectively, with 5k warm-up steps. The batch size for the VS-100h task is set to 128.

PS-Models: For fine-tuning XL PS-models on VS-100h, 1000h and 34kh, we use a transformer learning rate schedule with fixed warm-up step-size, but adjust the learning rate for the three tasks. The batch size is scaled up by a factor of 4 for each bigger task, while the learning rate is scaled up by a factor of 3 accordingly. For the PS-models, the encoder and decoder learning rate schedule is set to be the same. The VS-100h task learning rate schedule has peak learning rate 3e-5 and 5k warm-up steps. The batch size is set to 128.

Training from Scratch: The Conformer XL and XXL models trained from scratch have been trained with batch size 1024 using a transformer learning rate schedule with peak learning rate 1.8e-3/3.5e-4 respectively and 33k warm-up steps.

Noisy Student Training: Training parameters when training with the combined supervised and teacher generated data are kept the same as supervised training. We find the model performance to improve over a longer period of time, and require 2x to 4x training time compared to supervised training for convergence.

A-C Public Datasets

SpeechStew: The hyperparameters used for training the 1B-parameter P-model on the SpeechStew data are equivalent to those given in section 3 of [30] for training their 1B-parameter model pre-trained on Libri-Light. In particular, the supervised training is carried out for 100k steps with batch size 2048. For noisy student training, the model is trained for 200k steps with batch size 1024.

CHiME-6: The hyperparameters used for training the 1B-parameter PS-model on the CHiME-6 data are equivalent to those given in section 3.1 of [30] for training their 1B-parameter model pre-trained on Libri-Light and trained upstream on SpeechStew.


We would like to thank Daniel Adiwardana, Tony Bruguier, Yuan Cao, Zhehuai Chen, Mike Chrzanowski, Alexis Conneau, Xiangyu Dong, Thibault Doutre, Peter Gavin, Blake Hechtman, Ye Jia, Guangda Lai, Benjamin Lee, Chris Lee, Thang Luong, Andy Ly, Marcello Maggioni, Ananya Misra, Erica Moreira, Mohammad Norouzi, Tayo Oguntebi, Bramandia Ramadhana, Andrew Rosenberg, Ruoxin Sang, Jonathan Shen, Trevor Strohman, Weiran Wang, Haoyu Zhang and Yazhou Zu for useful discussions. We also thank Claire Cui and Johan Schalkwyk for their support of this work.