One challenge faced by modern ASR systems is that, with ever enlarged model capacity, large amount of labeled data are required to thoroughly train them. Unfortunately, collecting and transcribing huge dataset is expensive and time-consuming. As a result, semi-supervised ASR has been an important research direction, with the goal of leveraging a large amount of unlabeled data and a much smaller amount of labeled data for training. One of the simplest methods in this setting is self-training, which uses the decoding results or pseudo-labels on unsupervised data, often at the word level, to augment supervised training. It has been shown to be very effective with traditional ASR pipelines [25, 15, 38, 32].
In this work, we propose a novel framework for self-training in an end-to-end fashion. Starting from a carefully-trained Connectionist Temporal Classification (CTC, ) system, we alternate the following two procedures: generating pseudo-labels using a token-level decoder on a mini-batch of unsupervised utterances, and augmenting the just decoded (input, pseudo-label) pairs for supervised training. We show that this method can be derived from alternating optimization of a unified objective, over the acoustic model and the non-observed labels of unsupervised data. The two procedures effectively reinforce each other, leading to increasingly accurate models.
We emphasize a few important aspects of our method, which distinguish our work from others (detailed discussions on related work are provided later):
The pseudo-labels we use are discrete, token-level label sequences, rather than per-frame soft probabilities.
The pseudo-labels are generated on the fly, rather than in one shot, since fresh labels are of higher quality than those produced from a stale model.
We perform data augmentation not only on supervised data, but also on unsupervised data.
These modeling choices, which lead to performance gain over alterantives, are backed up by our empirical results. We demonstrate our method on the Wall Street Journal (WSJ) corpus.111Obtained from LDC under the catalog numbers LDC93S6B and LDC94S13B Our method improves PER by 31.6% relative on the development set, and WER by 14.4% relative on the test set from a well-tuned base system, bridging 50% of the gap between the base system and the oracle system trained with ground truth labels of all data.
In the rest of this paper, we review the supervised component of our method in Section 2, give detailed description of the proposed method in Section 3, compare with related work for semi-supervised ASR in Section 4, provide comprehensive experimental results in Section 5, and conclude with future directions in Section 6.
2 Supervised learning for ASR
Before describing the proposed method, we briefly review the supervised component in our system–CTC with data augmentation.
2.1 End-to-end ASR with CTC
Given an input sequence and the corresponding label sequence , CTC introduces an additional <blank> token and defines the conditional probability
where is the set of all paths (frame alignments) that would reduce to after removing repetitions and <blank> tokens, and
is the posterior probability of tokenat the -th frame by the acoustic model. The underlying assumption is that conditioned on the entire input sequence , the probability for a path decouples over the frames. The CTC loss for one utterance is then defined as . CTC training minimizes the averaged loss over a set of labeld utterances. It is well known that after training, the per-frame posteriors from the acoustic model tend to be peaky, and at most frames the most probably token is <blank> with high confidence, indicating “no emission”.
Due to the abovementioned independence assumption, CTC does not explicitly model transition probabilities between labels, and thus decoding—the problem of —is relatively straightforward. The simplest decoder for CTC is the greedy one, which picks the most probably token at each frame and then collapses them by removing repetitions and <blank>’s; we will be mostly using this decoder as it is extremely efficient. One can improve the greedy decoder by maintaining a list of hypothesis at each frame, leading to a beam search decoder with beamsize
. When the modeling units are subwords but word-level hypothesis are desired, one can incorporate lexicon and language models, which can be implemented efficiently in the WFST framework. We do not use word-level decoder for generating pseudo-labels since it is much slower than token-level beam search, and only use it for evaluating the word error rates (WERs). It should be noted that, our self-training method can make use of the attention-based systems [8, 6] as well. We use CTC mainly due to its simplicity and efficiency in decoding, for generating pseudo-labels on the fly.
2.2 Data augmentation
To alleviate the data sparsity issue, a natural approach that does not require unsupervised data is to augment the training data with distorted versions. And various data augmentation techniques have demonstrated consistent improvement for ASR [22, 27, 42, 31]. This simple way of obtaining supervised training signal helps us to improve our base system, which in turn generates pseudo-labels with higher quality on unsupervised data.
In this work, we adopt the speed perturbation and spectral masking techniques from . Both techniques perturb the inputs at the spectrogram feature level. One can view the input utterance as an image of dimension where corresponds to the number of frequency bins, and
the number of frames. Speed perturbation performs linear interpolation along the time axis, as in an image resizing operation; two speed factorsand are used here. Spectral masking selects segments of the input in the frequency axis with random locations, whose widths are drawn uniformly from , and similarly select segments in the time axis, with widths up to
. We perform grid search of hyperparameters for the supervised CTC system, and set, , , in all experiments.
3 Leveraging unsupervised data with self-training
After a base system is sufficiently trained on supervised data, it can be used to predict labels on the originally non-transcribed data. If we take the confident predictions and assume that they are correct, we can add the input and the predictions (pseudo-labels) into training. If the noise in pseudo-labels is sufficiently low, the acoustic model can benefit from the additional training data to obtain improved accuracy. We propose to repeat the pseudo-label generation and augmented training steps, so as to have the two reinforce each other, and to continuously improve both. In our method, for each update, we generate pseudo-labels for a mini-batch of unsupervised utterances using the current acoustic model with beam search, and compute the CTC losses for these utterances based on their most probable hypothesis. The losses for unsupervised utterances are discounted by a factor to accommodate label noise, and combined with the CTC loss for supervised data to derive the next model update. A schematic diagram of our self-training method is provided in Fig. 1.
Equivalently, we can formulate our method as minimizing the following objective:
where denotes the CTC loss, we have supervised utterances and unsupervised utterances, denotes weight parameters in the acoustic model, and we also include the (non-observed) label sequences of unsupervised utterances as variables. This is a well-defined learning objective, and our method effectively performing alternating optimization over (by beam search) and the weights (by gradient descent) over mini-batches. Additionally, we can perform data augmentation on the unsupervised data, by using the label sequence decoded from the original data on its distorted versions. We will show experimentally that augmenting unsupervised data is as effective as augmenting supervised data.
Our method is motivated by and similar to unsupervised data augmentation (UDA, 
) for semi-supervised learning, in that both methods use pseudo-labels and data augmentation on unsupervised data. But there is a crucial difference between the two: UDA uses soft targets (previous model output) for calculating the unsupervised loss, which encourages the model not to deviate much from that of the previous step, and in fact if there is no data augmentation, the loss on unsupervised data would be zero and has no effect for learning; in contrast, we use the discrete label sequence—output of the beam search decoder on soft targets—on each unsupervised utterance, which provides stronger supervised signals. While has not worked on sequence data, we have implemented a sequence version of it, by using the per-frame posterior probabilities as soft targets, and minimizing the cross-entropy loss between soft targets and model outputs at each frame; otherwise the implementation of UDA mirrors that of our method. As demonstrated later, our method outperforms UDA by a large margin.
In view of the peaky per-frame posterior distributions from CTC models, we think our approach has the advantage that the pseudo-labels are naturally high confidence predictions, relieving us from setting a threshold for discretizing soft probabilities. Although the alignment or locations of non-<blank> tokens can be imprecise from CTC systems, it is not an issue as we only use the label sequence but not its alignment in computing the unsupervised CTC loss, which marginalizes all possible alignments. In this regard, end-to-end systems give a more elegant formulation for self-training, than traditional hybrid systems which rely on alignments.
4 Related work
Semi-supervised ASR has been studied for a long time, and self-training has been one of the most successful approaches for traditional ASR systems (see, e.g., [25, 15, 38] and references therein). It is observed that in self-training, the quality of the pseudo-labels plays a crucial role, and much of the research is dedicated to measuring the confidence of pseudo-labels and selecting high confidence ones for supervised training [15, 38]. The issue of label quality becomes even more prominent with LSTM-based acoustic models, which have high memorization capability . In similar spirit,  have used a student-teacher learning approach on hybrid systems, to improve accuracy of student using soft targets provided by the teacher on a million hours of non-transcribed data.
leverage unpaired speech and text data by combining ASR with Text-to-Speech (TTS) modules, with a training loss that encourages pseudo-labels from ASR to reconstruct audio features well with the TTS system, and TTS outputs to be recognized by ASR. The authors have proposed different techniques to allow gradient backpropagation through the modules, and to alleviate the audio information loss during text decoding. Alternatively,
maps audio data with encoder of the ASR model, and maps text with another encoder to a common space, from which text is predicted (from the ASR side) or reconstructed (from the text side) with a shared decoder; an additional regularization term is used to encourage representations of paired audio and text to be similar. The common intuition behind these work is that of auto-encoders, the most straightforward method for unsupervised learning. On the other hand,
use adversarial training to encourage ASR output on unsupervised data to have similar distribution as that of unpaired text data, with a criticizing language model. Our model is much simpler than the above ones, in that we do not have additional neural network models for the text modality; rather, an efficient decoder is used to discretize the acoustic model outputs, and the pseudo-labels are immediately applied to acoustic model training as targets.
Before the submission of our paper, the work  came to our attention, which also adopts an end-to-end self-training approach. A few differences between our work and  are as follows: first, we evaluate our method with a CTC-based ASR model whereas they use an attention-based model; second, we use data augmentation on both labeled and unlabeled data and show that both are useful, whereas they do not; third, our method is simpler as we use neither word-level language model nor ensemble methods for generating pseudo-labels; finally, our pseudo-labels are generated on the fly, where they generate pseudo-labels on the entire unlabeled dataset once.
In the following, we demonstrate the abovementioned techniques on the WSJ corpus. We use the si84 partition (7040 utterances) as the supervised data, and the si284 partition (37.3K utterances) as unsupervised data. The dev93 partition (503 utterances) is used as development set for all hyper-parameter tuning, and the eval92 partition (333 utterances) as the test set. The setting is commonly used to demonstrate semi-supervised ASR [13, 4, 24]. For input features, we extract 40 dimensional LFBEs with a window size of 25ms and hop size of 10ms from the audio recordings, and perform per-speaker mean normalization. Furthermore, we stack every 3 consecutive input frames to reduce input sequence length (after data augmentation), which speeds up training and decoding.
The token set used by our CTC acoustic models are the 351 position-dependent phones together with the <blank> symbol, generated by the Kaldi s5 recipe 
. Acoustic model training is implemented with Tensorflow, and we use its beam search algorithm for generating pseudo-labels (with a beamsize ) and for evaluating PERs on dev/test (with a fixed beamsize of 20). To report word error rate (WER) on evaluation sets, we adopt the WFST-based framework  with the lexicon and the trigram language model with a 20K vocabulary size provided by the recipe, and perform beam search using Kaldi’s decode-faster with beamsize 20. Different positional versions of the same phone are merged before word decoding, and we use the phone counts calculated from si84 to convert posterior probability (acoustic model output) to likelihood.
Throughout the experiments, our acoustic model consists of 4 bi-directional LSTM layers  of 512 units in each direction. For model training, we use ADAM  with an initial learning rate tuned by grid search. We apply dropout  with rate tuned over 0.0, 0.1, 0.2, 0.5
, which consistently improves accuracy. We use the dev set PER, evaluated at the end of each training epoch, as the criterion for hyperparameter search and model selection.
5.1 Base system with data augmentation
As mentioned before, we will use a base system trained only on the supervised data to kick off semi-supervised training. For this system, we set the mini-batch size to 4 and each model is trained up to epochs. We apply data augmentation as described in Sec. 2.2, which effectively yields a 3x as large supervised set due to speed perturbation. In Table 1, we give PERs of the base system and another trained without augmentation. Observe that data augmentation provides sizable gain over training on clean data only (18.52% vs. 16.83% for dev PER), leading to higher pseudo-label quality. We will always use data augmentation on supervised data from now on.
5.2 Continue with self-training
Initialized from the base system, we now continue training with our semi-supervised objective (1). Each model update is computed with 8 supervised utterances and 32 unsupervised utterances (si284 is about 4 times the size of si84). By grid search, we set the dropout rate to , and initial learning rate to which is 5 times smaller than that for training the initial base model, and this has the effect of discouraging the model to deviate too much from the base model. Each model is trained for up to another epochs. We first set the beam size which corresponds to the greedy decoder, for generating pseudo-labels on the fly. We train two set of models, one with data augmentation on unsupervised utterances, and the other one without; but we augment supervised utterances in both cases. The dev PERs for different values of trade-off parameter are given in Fig. 2, and corresponds to the base system. Our method performs well for a wide range of . The optimal is around in both settings, and the performance does not degrade much with , indicating that noise within pseudo-labels is tolerated to a large degree. Furthermore, augmenting the unsupervised data greatly improves the final accuracy.
|CTC w/o DataAug||18.52||13.54|
|CTC base system||16.83||11.98|
|w/o DataAug on unsup||12.77|
|One-shot pseudo-labels ()||13.68|
To show that pseudo-label generation and supervised training with pseudo-labels reinforces each other, we provide in Fig. 3 the learning curve of dev PER vs. epoch for the models with . The dev set accuracy improves steadily over time, with significant PER reductions in the first a few epochs from the base model.
5.3 Effect of beam size
We now explore the effect of larger , which intuitively shall give higher pseudo-label quality. For this experiment, we fix other hyperparameters to values found at . In Table 2, we give the dev PER, as well as the training time for in . Learning curve with is plotted in Fig. 3. It turns out, with larger , we can slightly improve the final PER, at the cost of much longer training time (mostly from beam search). Therefore, we recommend using small with a good base model.
|dev PER (%)||11.51||11.46||11.39||11.30|
|Time / Update||4.72||7.88||12.10||16.53|
5.4 Comparison with UDA
We now show that hard labels are more useful than soft targets, by comparing with UDA, which replaces the CTC loss on unsupervised data with cross-entropy computed with posteriors from previous model. We also use data augmentation on unsupervised data, and posteriors are interpolated in the same way as in speed perturbation for inputs. We tune the tradeoff parameter by grid search, and the best performing model (with ) gives a dev PER of 14.56% and learning curve in Fig. 3.
5.5 Comparison with one-shot pseudo-labels
To further demonstrate the importance of fresh pseudo-labels, we compare with a more widely used approach where the pseudo-labels are generated once on the entire unsupervised dataset with the base model. We do so with a large decoding beam size , and then continue training from the base model with objective (1) without updating the pseudo-labels again. This approach does clearly improve over the base system with a dev PER of 13.68%, but not as much as our method with . Its learning curve is shown in Fig. 3, and the curve plateaus more quickly than those of our method.
5.6 Results summary
| (attention, train on si84,|
|unsup on si284 by ASR+TTS)||20.30|
|RNN-CTC , train on si84||13.50|
|Our CTC, train on si84, w/o DataAug||13.22|
|Our CTC (Oracle), train on si284||8.15|
|EESEN CTC , train on si284||7.87|
which uses the same data partition for semi-supervised learning with attention models is also included. To put our results in close context, we have included the CTC model from trained on si84 only. To obtain a performance upper bound for semi-supervised ASR, we have trained a model on the full si284 partition with ground truth transcriptions, yielding a test WER of 8.15%, close to the 7.87% from  despite differences in pipelines. Our method with gives a relative 31.6% dev PER reduction (16.83%11.51%), and a relative 14.4% test WER reduction (11.43%9.78%) over a carefully-trained base system with data augmentation, effectively reducing the performance gap between the base system (11.43%) and the oracle system (8.15%) by 50%.
6 Future directions
As for future directions, we believe that word-level decoding, which incorporates lexicon and language model, can further improve the quality of pseudo-labels after converting the word sequence back to token sequence (see, e.g., ), at the cost of longer decoding time. Another promising model to be used in our method is RNN-transducer , which has a built-in RNN LM to model label dependency and to improve token-level decoding. Furthermore, for larger one may consider the top a few hypotheses, and use all of them for computing the loss on unsupervised data [37, 23].
TensorFlow: Large-scale machine learning on heterogeneous systems. External Links: Cited by: §5.
-  (2011) ASRU. Cited by: 33.
-  (2015) ASRU. Cited by: 29.
-  (2019) Semi-supervised sequence-to-sequence ASR using unpaired speech and text. In Interspeech, Cited by: §4, §5.
-  (2019-April 30) Self-supervised sequence-to-sequence ASR using unpaired speech and text. Note: arXiv:1905.01152 [eess.AS] Cited by: §5.6, Table 3.
-  (2016) Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. See 16, Cited by: §2.1.
-  Data techniques for online end-to-end speech recogntion. Note: In Submission Cited by: §6.
-  (2015) Attention-based models for speech recognition. See 30, Cited by: §2.1.
Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. See 17, Cited by: §1.
-  (2014) Towards end-to-end speech recognition with recurrent neural networks. See 19, Cited by: §5.6, Table 3.
-  (2012) Sequence transduction with recurrent neural networks. In ICML Workshop on Representation Learning, Cited by: §6.
-  (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §5.
-  (2019) Cycle-consistency training for end-to-end speech recognition. In ICASSP, Cited by: §4, §5.
-  (2016) Semi-supervised training in deep learning acoustic model. In Interspeech, Cited by: §4.
-  (2013) Semi-supervised GMM and DNN acoustic model training with multi-system combination and confidence re-calibration. In Interspeech, Cited by: §1, §4.
-  (2016) ICASSP. Cited by: 6.
-  (2006) ICML. Cited by: 9.
-  (2013) ICML. Cited by: 22.
-  (2014) ICML. Cited by: 10.
-  (2015) Int. conf. learning representations. Cited by: 26.
-  (2015) Interspeech. Cited by: 27.
-  (2013) Vocal tract length perturbation (VTLP) improves speech recognition. See 18, Cited by: §2.2.
-  (2019) Self-training for end-to-end speech recognition. In arXiv:1909.09116, Cited by: §4, §6.
-  (2018) Semi-supervised end-to-end speech recognition. In Interspeech, Cited by: §4, §5.
-  (1999) Unsupervised training of a speech recognizer: recent experiments. In Eurospeech, Cited by: §1, §4.
-  (2015) Adam: A method for stochastic optimization. See 20, Cited by: §5.
-  (2015) Audio augmentation for speech recognition. See 21, Cited by: §2.2.
-  (2019) Adversarial training of end-to-end speech recognition using a criticizing language model. In ICASSP, Cited by: §4.
-  (2015) EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding. See 3, Cited by: §2.1, §5.6, Table 3, §5.
-  (2015) NIPS. Cited by: 8.
-  (2019-April 18) SpecAugment: A simple data augmentation method for automatic speech recognition. Note: arXiv:1904.08779 [eess.AS] Cited by: §2.2, §2.2.
-  (2019) Lessons from building acoustic models with a million hours of speech. In ICASSP, Cited by: §1, §4.
-  (2011) The Kaldi speech recognition toolkit. See 2, Cited by: §5.
-  (2019) Almost unsupervised text to speech and automatic speech recognition. In ICML, Cited by: §4.
Improving neural machine translation models with monolingual data. In ACL, Cited by: §4.
-  (2014) Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research. Cited by: §5.
-  (2018) An investigation of a knowledge distillation method for CTC acoustic models. In ICASSP, Cited by: §6.
-  (2013) Deep neural network features and semi-supervised training for low resource speech recognition. In ICASSP, Cited by: §1, §4.
-  (2017) Listening while speaking: Speech chain by deep learning. In ASRU, Cited by: §4.
-  (2018) Machine speech chain with one-shot speaker adaptation. In Interspeech, Cited by: §4.
-  (2019) Unsupervised data augmentation for consistency training. In arXiv:1904.12848, Cited by: §3.
-  Improved regularization techniques for end-to-end speech recognition. Note: arXiv:1712.07108 [cs.CL] External Links: Cited by: §2.2.
Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, Cited by: §4.