Automatic Speech Recognition (ASR) is one of the core tasks in speech processing, which aims to generate transcripts from speech utterances. Recently, end-to-end ASR models have been extensively studied, as most of such models do not require the explicit learning of acoustic and language models [1, 2, 3, 4].
Despite their success, a potential drawback of end-to-end ASR models is that they often require large amounts of transcribed data to produce satisfactory results . Unfortunately, transcribing audio data by human annotators is both time-consuming and financially-expensive, which limits the further improvement of such models. Recently, unsupervised pre-training has been applied to the ASR task [6, 7], which uses unlabeled audio data to pre-train ASR models. However, it is still difficult for these models to outperform semi-supervised or fully supervised approaches [6, 7]. There are a few methods that generate syntactic speech data [8, 9], but the generated speech audios may still be different from audios recorded in real life.
Facing such conditions, a natural question arises: is it possible to build accurate end-to-end ASR systems without much labeled training data? Here, we introduce a weakly supervised framework to construct ASR systems from massive video data. The high-level architecture is shown in Figure 1, which consists of two major stages: Weakly Supervised Pre-training (WSP) and Domain-specific Fine-tuning (DF). During WSP, based on Optical Character Recognition (OCR) , we extract human-speech audios aligned with subtitles from videos as knowledge sources to pre-train ASR models. In our work, we pre-train our model over videos of varied topics so that the pre-trained model is able to acquire transferable and general knowledge of ASR across domains. After that, the underlying ASR model can be fine-tuned to fit training data in any domains, which is usually smaller in size. This framework is highly general in that it can be applied to arbitrary languages and end-to-end ASR models. In the experiments, we evaluate our framework over several popular ASR model architectures and public datasets. The results show that it can consistently produce state-of-the-art results for Mandarin speech recognition.
2 Related Work
In this section, we summarize related work on end-to-end ASR models and pre-training techniques for ASR.
End-to-end ASR. While hybrid ASR techniques are continuously developing (such as classical DNN-HMM-style models ), due to the simple model pipelines, end-to-end ASR models have gained much attention. Recurrent-style networks are naturally suitable for end-to-end ASR task as they model the sequences of audios and languages [12, 1, 2, 13]; however, they may be slow during training and inference. This reduces the application scopes of such models in industry. CNN-based approaches (e.g., wav2letter ) are much faster in speed, but they have limited capacity in language modeling for long sequences. Several studies show that transformer-based methods [15, 16, 17, 18, 19] have better performance because have strong language modeling ability to capture long-term dependencies. Transformer-based models also converge much faster and produce more accurate results when the CTC (Connectionist Temporal Classification) loss is added as an auxiliary loss . Because the model architecture design is not our major focus, we do not further elaborate.
Pre-training ASR Models. End-to-end ASR models require a large amount of training data to achieve desirable performance. Two streams of works have been proposed to reduce the requirements of manually labeled speech recognition data.
One stream of works applies unsupervised/semi-supervised methods to tackle the problem. For example, Long et al.  propose an improved self-training approach for semi-supervised training of DNN and RNN based acoustic models. Karita et al.  jointly learn ASR and TTS models for semi-supervised training of ASR models. Chung et al. 
introduce the Auto-regressive Predictive Coding (APC) for unsupervised speech representation learning. Inspired by BERT for natural language processing, Baevski et al.  and Schneider et al.  propose masked predictive coding for unsupervised pre-training of transformer encoders. Although semi-supervised and unsupervised methods can improve the performance, there is still gap from supervised model training with labeled data [6, 7].
The other stream focuses on extracting aligned text-speech segments using existing ASR models. To name a few, Lanchantin et al.  align paragraphs of transcripts with audios to generate training data. The work 
introduces the “island of confidence” filtering heuristic to extract useful speech segments with transcripts from Youtube videos. Lakomkin et al. propose a set of filtering rules to construct speech datasets from Youtube videos and auto-sync captions. However, these methods generally require a well-performed ASR model to start-up. Moreover, the extracted speech segments are generally in phrase levels, with long-range contextual information missing. Our work does not rely on accurate ASR models and can generate high-quality utterance-text pairs in sentence levels.
3 The Proposed Framework
In this section, we first introduce technical detailed on two stages of our framework, i.e., WSP and DF. After that, we describe the ASR model architectures that we use in this work.
3.1 Weakly Supervised Pre-training
We first present the pipeline of WSP, illustrated in Figure 2.
Video Acquisition. As many videos have embedded subtitles that are almost synchronous with audios, we regard such videos as pre-training knowledge sources. The videos that we use in this work are provided from Youku 222Youku (http://www.youku.com) is a popular video hosting service, a subsidiary of Alibaba Group. It holds the copyrights of these videos, and permits authors to obtain and process the data as described..
Text and Audio Spotting. Although videos with subtitles are made available to us, subtitles are generally embedded in frame images in different styles and formats, especially in videos made in early years. This prevents us from extracting subtitles from raw data sources directly. Hence, we first extract frame images from each video with an interval of 1/3 second. Next, we employ the IncepText model  to detect text positions from images and the OCR model  to recognize the text contents. We constrain that only the frame with the maximal number of texts across nearby frames are selected.
Given a sequence of frame images within a time window size (denoted as ), we wish to determine whether two consecutive frames and (, ) can be “merged” so that a subset of such frames may correspond to the same subtitle. Hence, the audios within the time frames can be also extracted, treating as the speech for the subtitle.
We present two merging methods: Heuristics-based and Model-based. For two consecutive frames and , denote the detected texts as and , respectively. Define the Relative Edit Distance (RED) between and as:
where is the edit distance between and , and is the length of . Heuristics-based Merging combines two frames and if is smaller than a tuned threshold.
Despite its simple implementation, Heuristics-based Merging ignores the relations between audios and texts. If any existing third-party ASR model is available, no matter whether it is accurate enough or not, we can use it to refine the merging process 333We use the model from https://ai.aliyun.com/nls/asr. The CER is slightly larger than 20% based on their descriptions.. Let be the audio segment w.r.t. the frame . Model-based Merging employs an existing ASR model to predict the transcript of , denoted as . If and should not be merged, the error rate of model is computed as:
where is the Character Error Rate (CER) of model ’s predictions. If and should be merged, similarly, we have the combined error rate:
where is the concatenation of audio segments and . Based on the two error rates, we can determine that the segments and should be separated if , or merged otherwise.
Iterative Pre-training. After the extraction and merging steps, we obtain a large “pseudo-labeled” dataset , consisting of audio-transcript segment pairs. Because supervised ASR model learning produces better results than unsupervised approaches , we simply pre-train the ASR model using the way as normal training over the dataset . However, the extraction process of unavoidably injects noise into the dataset due to the lack of human annotation. To alleviate this problem, during pre-training, we apply a self-training strategy to filter out noisy data during the pre-training process 
. During each epoch of ASR model training, we filter out audio-transcript segmentsfrom that are most likely to have noisy transcripts and use the remaining dataset for the next training epoch. Due to space limitation, we omit the algorithmic details here and refer interested readers to the paper .
3.2 Domain-specific Fine-tuning
Based on the pre-training objective, our framework could generate ready-to-use ASR models directly. However, we find that most videos we use are dramas of various topics. The domains of such data may be significantly different from downstream ASR tasks. Hence, given a (small) training set of domain , we fine-tune the pre-trained model over
to learn domain-adaptive parameter values. One can also leverage transfer learning techniques during fine-tuning using bothand , which is left as future work.
3.3 Choices of Model Architectures
Following common industry practices, we consider two popular end-to-end ASR models: wav2letter  and Speech Transformer . Detailed architecture designs are shown in Figure 3. The wav2letter model uses one dimensional convolution networks with large kernels as feature encoders, and the CTC loss for training. Although the structure seems simple, its efficient inference speed makes it appealing to various industrial applications. For inference, we employ a 5-gram language model and a beam size of 64 to improve the model accuracy.
Speech Transformer  adopts self-attention for acoustic modelling and decoding. Following , the CTC loss is added as an auxiliary loss to achieve faster convergence and better performance. In multi-head attention layers, we set the hidden size as 512, with the number of heads to be 8. To improve the robustness of the model, inspired by 
, we ensemble the last 10 training checkpoints as our final model collection. For inference, we apply beam search of size 16 to all models in parallel, in order to generate texts that are most probably correct.
In this section, we conduct extensive experiments to evaluate the proposed framework in various aspects.
4.1 Datasets and Experimental Settings
For WSP, we obtain 940 drama series under 16 broad categories from Youku
, containing 43,694 video clips. The total duration of these videos is around 8,000 hours. During pre-training, the learning rates of wav2letter and Speech Transformer are set as 0.05 and 1.0, respectively. For both models, we normalize the utterances to 16kHz and generate the logarithm of FBank features of 80 dimensions, with a window size of 20ms and the stride of 10ms. We use our in-house IncepText and OCR models for text spotting.
After WSP, we fine-tune and evaluate our models over six public datasets: STCMDS444http://www.openslr.org/38/, AISHELL-1555http://www.aishelltech.com/kysjcp/, AISHELL-2666http://www.aishelltech.com/aishell2/, AIDATANG777http://www.openslr.org/62/, MagicData888http://www.openslr.org/68/ and HKUST999https://catalog.ldc.upenn.edu/LDC2005S15/. The statistics are displayed in Table 1
. We can see that the datasets are varied in domains and styles and have relatively short duration compared to our pre-training dataset. During fine-tuning, we set the learning rates to be 0.01 and 0.5 for the two models, respectively. We keep the training, development and testing splits of all the datasets as their default settings. All the algorithms are implemented in Tensorflow and run on GPU servers.
|wav2letter (w/o. WSP)||4.5||11.7||12.5||12.9||7.4||35.7|
|wav2letter (w. WSP)||2.4||7.1||10.0||9.2||6.7||29.3|
|Speech Transformer (w/o. WSP)||4.4||6.7||7.4||7.8||3.6||23.5|
|Speech Transformer (w. WSP)||2.1||5.9||5.9||4.9||3.3||20.0|
4.2 General Performance Comparison
We report the general performance of our models in all the testing sets. For baselines, we consider both classical ASR models and recent end-to-end approaches, including TDNN , Chain-Model , MS-Attn , SpeechBERT  and SAN-M . For wav2letter and Speech Transformer, we test the model performance under both settings: i) w. WSP and ii) w/o. WSP, based on our own implementations. The results are summarized in Table 2. We have the following findings: i) Speech Transformer consistently outperforms wav2letter across all the datasets101010Despite its relatively high error rate, the wav2letter model still has wide applications in industry due to its simple architecture and fast inference speed. The applications are beyond the scope of this paper.. ii) The WSP technique effectively boosts the performance of both models on all the datasets. This phenomenon is more significant on small datasets (i.e., AIDATANG and HKUST). iii) The Speech Transformer model w. WSP achieves state-of-the-art performance on all the six public datasets, which clearly proves the value of our framework.
4.3 Detailed Model Analysis
Analysis of WSP.
To create the pre-training dataset, we test both merging techniques via a manual check on 0.2% of the generated text-speech pairs. We observe that model-based merging produces better results. The CER is around 6%, close to manually labeled datasets. For example, the CERs of human-labeled data in AISHELL-1 and MagicData are close to 5% and 2%, respectively. This shows, even without human annotation, we are able to generate pre-training datasets with tolerable error rates. After text and audio spotting, we obtain a total of 1,825,927 utterances from all video clips, ranging from 15-20s.
Next, we evaluate the effectiveness of the iterative pre-training technique. In this step, we filter out part of the data (quantified by the drop ratio ) and take the rest as the pre-training data for the next iteration. We search the best value of from and also compare our method with a classical data filtering approach (i.e., Liao et al. ). Regarding the implementation of Liao et al. , we use the third-party Mandarin ASR model from https://ai.aliyun.com/nls/asr, instead of their original English ASR model. As an example, in Table 3, we display the CER values produced by the pre-trained wav2letter model without fine-tuning, evaluated on the AISHELL-1 development set. It shows that WSP with has the best performance.
|Liao et al. ||17.3||16.8||16.5|
Convergence analysis. Next, we investigate how WSP affects the DF performance. The convergence curves of the two models on the dataset HKUST are shown in Figure 5. As seen, wav2letter and Speech Transformer converge within 10 and 3 training epochs, respectively. Compared to the same models without the WSP step, the speed of convergence is much faster for both models, which clearly indicates WSP is able to find better parameter initialization for domain-specific ASR tasks, no matter whether there exist domain differences between the pre-training dataset and public datasets.
Error analysis and case studies. We further present an error analysis for deeper understanding of WSP. We investigate the percentages of different types of errors occurred in the test sets of AISHELL-1 and HKUST, with results shown in Table 4. The underlying ASR models are the Speech Transformer with and without WSP. As seen, the majority of the errors are substitution errors caused by homophones. The WSP technique helps to reduce such errors, because the pre-training dataset is much larger than public Mandarin datasets. The pronunciations and language contexts are more diverse, leading to the better generalization ability of trained ASR models. Two typical cases can be also found in Figure 4, with Chinese pronunciation (spelled in Mandarin phonetic symbols) and English translation provided. It shows WSP’s ability to distinguish words with similar pronunciation.
5 Conclusion and Future Work
In this paper, we present a complete workflow to construct accurate ASR systems based on weak supervision of massive video data. Experiments confirm the effectiveness of the proposed approach. With WSP and our designed Speech Transformer model, we achieve the state-of-the-art results on several datasets. Future work includes i) applying our approach to other languages and ASR models; ii) combining unsupervised and weakly supervised pre-training in our framework; and iii) leveraging transfer learning to improve the fine-tuning process.
-  D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos, E. Elsen, J. H. Engel, L. Fan, C. Fougner, A. Y. Hannun, B. Jun, T. Han, P. LeGresley, X. Li, L. Lin, S. Narang, A. Y. Ng, S. Ozair, R. Prenger, S. Qian, J. Raiman, S. Satheesh, D. Seetapun, S. Sengupta, C. Wang, Y. Wang, Z. Wang, B. Xiao, Y. Xie, D. Yogatama, J. Zhan, and Z. Zhu, “Deep speech 2 : End-to-end speech recognition in english and mandarin,” in ICML, vol. 48, 2016, pp. 173–182.
W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” inICASSP, 2016, pp. 4960–4964.
-  G. Saon, Z. Tüske, K. Audhkhasi, and B. Kingsbury, “Sequence noise injected training for end-to-end speech recognition,” in ICASSP, 2019, pp. 6261–6265.
-  X. Chang, W. Zhang, Y. Qian, J. L. Roux, and S. Watanabe, “End-to-end multi-speaker speech recognition with transformer,” in ICASSP, 2020, pp. 6134–6138.
-  J. Hsu, Y. Chen, and H. Lee, “Meta learning for end-to-end low-resource speech recognition,” in ICASSP, 2020, pp. 7844–7848.
-  S. Schneider, A. Baevski, R. Collobert, and M. Auli, “wav2vec: Unsupervised pre-training for speech recognition,” in Interspeech, 2019, pp. 3465–3469.
-  A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” arXiv preprint arXiv:2006.11477, 2020.
-  Y. Chen, Z. Yang, C. Yeh, M. Jain, and M. L. Seltzer, “Aipnet: Generative adversarial pre-training of accent-invariant networks for end-to-end speech recognition,” in ICASSP, 2020, pp. 6979–6983.
-  N. Rossenbach, A. Zeyer, R. Schlüter, and H. Ney, “Generating synthetic audio data for attention-based speech recognition systems,” in ICASSP, 2020, pp. 7069–7073.
C. Lee and S. Osindero, “Recursive recurrent nets with attention modeling for OCR in the wild,” inCVPR, 2016, pp. 2231–2239.
-  T. Tanaka, R. Masumura, T. Moriya, T. Oba, and Y. Aono, “A joint end-to-end and DNN-HMM hybrid automatic speech recognition system with transferring sharable knowledge,” in Interspeech, G. Kubin and Z. Kacic, Eds., 2019, pp. 2210–2214.
-  V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts,” in Interspeech, 2015, pp. 3214–3218.
-  D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur, “Purely sequence-trained neural networks for ASR based on lattice-free MMI,” in Interspeech, 2016, pp. 2751–2755.
-  R. Collobert, C. Puhrsch, and G. Synnaeve, “Wav2letter: an end-to-end convnet-based speech recognition system,” arXiv preprint arXiv:1609.03193, 2016.
-  Y. Zhao, J. Li, X. Wang, and Y. Li, “The speechtransformer for large-scale mandarin chinese speech recognition,” in ICASSP, 2019, pp. 7095–7099.
-  N. Li, Y. Liu, Y. Wu, S. Liu, S. Zhao, and M. Liu, “Robutrans: A robust transformer-based text-to-speech model,” in AAAI, 2020, pp. 8228–8235.
-  N. Moritz, T. Hori, and J. L. Roux, “Streaming automatic speech recognition with the transformer model,” in ICASSP, 2020, pp. 6074–6078.
-  K. J. Han, J. Huang, Y. Tang, X. He, and B. Zhou, “Multi-stride self-attention for speech recognition,” in Interspeech, 2019, pp. 2788–2792.
-  Z. Gao, S. Zhang, M. Lei, and I. McLoughlin, “SAN-M: memory equipped self-attention for end-to-end speech recognition,” arXiv preprint arXiv:2006.01713, 2020.
-  H. Miao, G. Cheng, C. Gao, P. Zhang, and Y. Yan, “Transformer-based online ctc/attention end-to-end speech recognition architecture,” in ICASSP, 2020, pp. 6084–6088.
Y. Long, Y. Li, S. Wei, Q. Zhang, and C. Yang, “Large-scale semi-supervised training in deep learning acoustic model for ASR,”IEEE Access, vol. 7, pp. 133 615–133 627, 2019.
S. Karita, S. Watanabe, T. Iwata, M. Delcroix, A. Ogawa, and T. Nakatani, “Semi-supervised end-to-end speech recognition using text-to-speech and autoencoders,” inICASSP, 2019, pp. 6166–6170.
Y. Chung, W. Hsu, H. Tang, and J. R. Glass, “An unsupervised autoregressive model for speech representation learning,” inInterspeech, 2019, pp. 146–150.
-  J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in NAACL-HLT, 2019, pp. 4171–4186.
-  P. Lanchantin, M. J. F. Gales, P. Karanasou, X. Liu, Y. Qian, L. Wang, P. C. Woodland, and C. Zhang, “Selection of multi-genre broadcast data for the training of automatic speech recognition systems,” in Interspeech, 2016, pp. 3057–3061.
-  H. Liao, E. McDermott, and A. W. Senior, “Large scale deep neural network acoustic modeling with semi-supervised training data for youtube video transcription,” in ASRU, 2013, pp. 368–373.
-  E. Lakomkin, S. Magg, C. Weber, and S. Wermter, “Kt-speech-crawler: Automatic dataset construction for speech recognition from youtube videos,” in EMNLP, 2018, pp. 90–95.
-  Q. Yang, M. Cheng, W. Zhou, Y. Chen, M. Qiu, and W. Lin, “Inceptext: A new inception-text module with deformable PSROI pooling for multi-oriented scene text detection,” in IJCAI, 2018, pp. 1071–1077.
-  J. Huang, L. Qu, R. Jia, and B. Zhao, “O2u-net: A simple noisy label detection approach for deep neural networks,” in ICCV, 2019, pp. 3325–3333.
-  S. Zhang, C. Do, R. Doddipatla, and S. Renals, “Learning noise invariant features through transfer learning for robust end-to-end speech recognition,” in ICASSP, 2020, pp. 7024–7028.
-  Y. Xu, X. Qiu, L. Zhou, and X. Huang, “Improving BERT fine-tuning via self-ensemble and self-distillation,” arXiv preprint arXiv:2002.10345, 2020.
-  Y. Chuang, C. Liu, and H. Lee, “Speechbert: Cross-modal pre-trained language model for end-to-end spoken question answering,” arXiv preprint arXiv:1910.11559, 2019.