End-to-end models are profoundly affecting the development of automatic speech recognition (ASR), and have demonstrated their advantages on a wide range of ASR tasks. Since end-to-end ASR models directly transform the speech sequence (usually in the form of feature frames) into the text label sequence by neural networks, there must be one end (the text end or the speech end) to drive the recognition process of the entire model. According to the driving end, current end-to-end models could be categorized into two types: label-synchronous models and frame-synchronous models.
Label-synchronous models refer to the models that are driven by the text end: they process one label at each step and stop after the label of end of sentence is recognized. The representative models in them are originally proposed for neural machine translation (NMT)[1, 2], and rely on the attention mechanism to extract relevant speech information from the encoded frames, such as the attention-based models [3, 4, 5, 6, 7, 8] and transformer [9, 10, 11]. Compared with frame-synchronous models, they have two main advantages: 1) since they usually attend to a large range of encoded frames, they could utilize the speech information more comprehensively; 2) since they are label-synchronous, they could jointly extract the relevant information for each label from multiple channels  or sources [13, 14]. Recently, since transformer has shown performance superiority in the comparison with the RNN attention-based model , we focus on the model of transformer in this work.
Frame-synchronous models refer to the models that are driven by the speech end: they process one frame at each step and stop after the last frame is processed. Compared with label-synchronous models, they have two main advantages: 1) since they process in the frame-by-frame manner, they could naturally support online speech recognition; 2) since they are aware of the frame position, they could provide time stamp for the recognition result. The representative models in them could be categorized into two types: 1) the models that use a hard  alignment, including the CTC  and the CTC-like [17, 18] models; 2) the models that use a soft  alignment, which first locates the acoustic boundary through the frame-by-frame detection, and then integrate the acoustic information from the located speech piece using the form of weighted-sum (e.g. [19, 20, 21]). Compared with the hard ones, the soft frame-synchronous models exclude the influence of the blank label, thus having less decoding computation and simpler search process. In this work, we focus on the soft frame-synchronous models and experiment on the continuous integrate-and-fire (CIF) based model , which is calculated in a concise manner and uses the same self-attention network (SAN) as transformer.
The concepts of label-synchronous and frame-synchronous first appear in , where the authors compare the attention-based model  with the CTC  and RNN-T  models in detail. Differing from it: 1) we think the concepts of label-synchronous and frame-synchronous should not only be used to describe the decoding manner , but could be used to describe end-to-end models with different model behaviour and characteristic. By distinguishing the two types of models, it not only benefits to clarify the advantages and disadvantages of them, but also could provide guidance for applying the suitable model in specific ASR scenarios; 2) we focus on the comparison of a label-synchronous model (transformer) with a soft frame-synchronous model (CIF-based model), rather than the hard frame-synchronous models in . Specifically, we make a detailed comparison for the two models on multiple datasets, including three public sets and a large-scale data set with 12000 hours of training data. We find the two models show their respective advantages on accuracy, speed and generalization. The detailed contents will be introduced in the following sections.
In this section, we describe the label-synchronous model (transformer) and the frame-synchronous model (continuous integrate-and-fire (CIF) based model) compared in this work. Both of them follow the encoder-decoder framework, where the encoder transforms the speech feature sequence into the encoded acoustic representations , the decoder receives the previous label and the relevant acoustic information to predict the current label . The encoder and the decoder are connected by the alignment mechanism, which determines the extraction manner of the relevant acoustic information used for decoding. Next, we will combine the diagram of the alignment mechanism (as shown in figure 2) to introduce the two models.
The transformer model used in this work has the similar structure as [9, 10], but has two differences: 1) it abandons the sinusoidal positional encoding in [9, 10] and uses the proximity bias in the self-attention network (SAN)  to provide relative positional information (since this performs better and more stable in our experiments); 2) it uses the encoder structure in , which uses a convolutional front-end and a pyramid of SAN.
The running of transformer is driven by the text end, or the decoder end. At each decoder step , the previous label is input to a stack of identical decoder blocks, each of which is composed by inserting a multi-head encoder-decoder attention into the two sub-networks (multi-head self-attention and position-wise feed-forward network) of the SAN. Each of the multi-head encoder-decoder attention receives the encoded representation as the key and value, and receives the outputs of the multi-head self-attention (in the same decoder block) as the query, then they conduct the multi-head attention to extract the relevant acoustic information. As shown in figure 2 (a), transformer runs label by label, and stops after the label of end of sentence ( e ) is predicted. Assume the length of the predicted labels is , the length of is , the computational complexity of the alignment mechanism in transformer is .
2.2 CIF-based model
The CIF-based model used in this work follows the model structure in . Its encoder first generates the encoded representation for each encoder step . Then, it predicts a weight for each by passing a window centered at (e.g. ) to a 1-dimensional convolutional layer and then a fully connected layer with one output unit and a sigmoid activation. Next, it passes the weight and to CIF, which forwardly accumulates the weight and integrates the acoustic representation (using the form of weighted sum) until the accumulated weight reaches a threshold (1.0), which means an acoustic boundary is located. At this point, CIF divides current weight into two part: the one is used to fulfill the integration of current label by building a complete distribution (whose sum of weights is 1.0) on relevant encoder steps, the other is used for the integration of next label. After that, it fires the integrated acoustic embedding
(as well as the context vector) to the decoder to predict the corresponding label.
The running of CIF-based model is driven by the speech end, or the encoder end. At each encoder step , the representation and the weight of the encoded frame are input to the CIF module, which determines if there have integrated enough acoustic information to trigger the calculation of the decoder. As shown in figure 2 (b), the CIF-based model runs frame by frame, and stops after the last frame is processed. Assume the length of is , the computational complexity of the CIF alignment mechanism is .
2.3 Model details
The two models use the same encoder structure , which reduces the frame rate to 1/8. The CIF-based model uses the autoregressive decoder that is based on SAN. The decoder of both the two models caches the computed SAN states of all heads for efficient inference. In the training, both of the two models use multi-task learning: the CIF-based model is trained under the cross-entropy loss, the CTC loss on the encoder with coefficient and the quantity loss on the CIF with coefficient , and the transformer is trained under the cross-entropy loss and the CTC loss on the encoder . In the inference, we first perform beam search to get the decoded hypothesis, then use the score of a SAN-based language model (LM) with coefficient to rescore the hypothesis predicted by the two models.
3 Experimental Setup
|test_clean||test_other||test_ios||test_android||test_mic||test_set 1||test_set 2||test_set 3|
Results of accuracy performance for the models compared in this work. The evaluation metric uses word error rate (WER) for Librispeech, and uses character error rate (CER) for the three Mandarin datasets, and keep the same for the later table3 and 4.
We conduct the comparison on three public ASR datasets and a large Mandarin ASR dataset (CN12000h) that contains about 12000 hours of training data. The three public ASR datasets include the English read-speech dataset (Librispeech ) that contains 960 hours of training data, the Mandarin read-speech dataset (AISHELL-2 ) that contains 1000 hours of training data, and the Mandarin telephone ASR benchmark (HKUST ) that contains 168 hours of training data. Other details about the usage of three datasets are introduced in . For the CN12000h dataset, the training set covers data from different application scenarios, so we use three representative test sets to evaluate the performance of the models and use one of them (test_set 2) as the development set.
are applied to the models on the three datasets. For the CN12000h dataset, the input features use 29 dimensional filterbanks extracted from 25 ms window and shifted every 10 ms, then extended with delta and delta-delta, and the global normalization. The output labels cover 4733 classes including 4701 Chinese characters, 26 uppercase English letters, 3 special marker (noise, etc), the blank, the label of end-of-sentence and the pad. Above setting keeps the same for the two models compared in this work.
For the CIF-based model, the encoder uses the 2-layer convolutional front-end and the pyramid of self-attention network (SAN) in , where the head number , the hidden size and the inner size in SAN are set to for all Mandarin datasets and set to for Librispeech, in the pyramid structure is set to 5 for all datasets, thus the number of SAN encoder layers is 15. The 1-dimensional convolutional layer that predicts weights uses convolutional filters, and the convolutional width is 3. Besides, layer normalization 
and a ReLU activation are applied after this convolution. The decoder uses the autoregressive decoder in and the number of SAN decoder layers is 2. The coefficient on the CTC loss is set to 0.5 for all Mandarin datasets and is set to 0.25 for Librispeech, on the quantity loss is set to 1.0.
For transformer, it uses the same , and as the SAN in the CIF-based model. The encoder uses the same structure as the CIF-based model except in the pyramid structure is set to 4 for all datasets, thus the number of SAN encoder layers is 12. The decoder uses 6-layer decoder blocks. The setting of the layer number in the encoder and the decoder refers to the best setting in [9, 10], and it also makes the two models have similar amount of parameters. The coefficient on the CTC loss keeps the same as the CIF-based model.
The language model (LM) uses SAN with , ,
for the three public datasets, and the number of SAN layers is set to 3, 6, 20, 6 for HKUST, AISHELL-2, Librispeech and CN12000h, respectively. Above models are implemented on TensorFlow.
We use almost the same training and inference strategy for the two models. In the training, we batch the data with approximate frame length together and fill about 20000 frames for every batch of the three public dataset, and fill 37500 frames (the maximum for a p6000 gpu) for every batch in CN12000h. We use 4, 8, 8, 8 gpus to train the models on HKUST, AISHELL-2, Librispeech and CN12000h, respectively. We use the optimizer and the varied learning rate formula in , where the warmup step is set to 25000 for HKUST and is set to 36000 for other datasets, the global coefficient is set to 4.0. We only apply dropout to the SAN, whose attention dropout and residual dropout are all set to 0.2 except the models on CN12000h that is set to 0.1. We use the uniform label smoothing in  and set it to 0.2 for both of the ASR models and the language models. Parallel scheduled sampling (PSS)  with a constant sampling rate of 0.5 is applied to the models on the three Mandarin datasets. After training, we average the newest 10 checkpoints for inference. In the inference, we use beam search with size 10 for all models. For the CIF-based model, the coefficient for LM rescoring is set to 0.1, 0.2, 0.9, 0.2 for HKUST, AISHELL-2, Librispeech and CN12000h respectively. For transformer, the hyper-parameter is set to 0.6, 0.9, 2.0, 0.6 for HKUST, AISHELL-2, Librispeech and CN12000h, respectively.
In this section, we present the results of the compared two models on the performance of accuracy, speed and generalization. The two models are evaluated on the offline mode, and use almost the same training strategy, similar model structure and similar amount of parameters to make the comparison as fair as possible, the details are introduced in section 3.
4.1 Comparison on accuracy performance
As shown in table 1, we find transformer performs better on 3/4 datasets (Librispeech, HKUST and CN12000h), while the CIF-based model only shows better performance on AISHELL-2, which is a Mandarin read speech dataset that has clear acoustic boundary between labels. To our best knowledge, the result of CIF-based model on AISHELL-2 and the result of transformer on HKUST are the best accuracy performance of respective dataset, thus the comparison is conducted on two very strong models that have different synchronous mode.
From the perspective of synchronous mode, the accuracy difference between the two models can be attributed to two aspects: 1) since transformer is a label-synchronous model and extracts acoustic information of each label from the global view, it could make comprehensive use of the encoded acoustic information and capture more acoustic details for the decoding. In contrast, the CIF-based model performs a frame-by-frame manner and doesn’t cache the processed frames like , thus it integrates acoustic information from a limited local view, which may affect its modeling expressiveness to some extent; 2) since the CIF-based model is a soft frame-synchronous model that needs to locate the acoustic boundary during processing, this makes it perform slightly inferior on the datasets with blurred acoustic boundary between labels. In contrast, transformer is less affected by the clearness of acoustic boundary. In addition to the above, the multi-layer encoder-decoder attention of transformer (from its decoder blocks) also benefits to its extraction of acoustic information and promotes better accuracy.
4.2 Comparison on speed performance
The accuracy advantage of transformer is partly due to the global encoder-decoder attention. However, attending to every encoder step is bound to bring a mass of unnecessary calculations on steps that are acoustically irrelevant to the decoding label, which may affect its calculation speed. Thus we also compare the speed performance, including the training speed and the real time factor (RTF) (= inference time / audio time).
As shown in table 2, we find transformer has faster training speed than the CIF-based model. Since both of them use the SAN-based encoder and decoder, which are computed in a highly parallel manner, the gap on the training speed can be attributed to the frame-by-frame incremental calculation by the CIF. Even so, since the incremental calculation by the CIF is lightweight, the training efficiency of the CIF-based is still considerable and is about of transformer.
Besides, we find the CIF-based model obtains about of the RTF than transformer, which means it obtains about times inference speed than transformer. Since the two models are compared on the offline mode, both of them perform one-time encoder calculation and multi-time decoder calculations, the heavy decoder of transformer that takes cost to extract the acoustic information inevitably brings large amount of calculations. In contrast, the CIF-based model that takes cost to integrate the acoustic information has much lighter inference calculation, which is one of the advantages of this type of frame-synchronous models.
4.3 Comparison on generalization
In addition to the accuracy and speed, we wonder to know whether the synchronous mode affects the modeling generalization on some special cases.
We first compare the two models on the long utterances, which are generated by randomly concatenating multiple utterances of the same speaker in the test set, each duration covers at least 50 utterances. For fair comparison, we do not use the long utterance strategy for specific models in [33, 34].
As shown in table 3, we find the CIF-based model performs more stable on the long utterances, and its deletion and insertion errors are kept at a comparative level for all durations. In contrast, the deletion error of transformer increases rapidly as the length of utterance grows and becomes the main reason of its performance degradation on the long utterances. We also notice the performance gap between the two models on the same durations is different on the two datasets. We suspect it is because the averaged length of training utterances on AISHELL-2 (3.55 seconds) is shorter than Librispeech (12.28 seconds).
Then we compare the two models on the repeated utterances and the noisy utterances. The repeated utterances are generated by randomly selecting 1/10 utterances of each test set, then concatenating the same utterance repeatedly. The number of repetition is 1, 2, 3 and 4 (denoted as , , , in table 4). The noisy utterances are generated by mixing every utterance in the test set with one of the 115 noise types in 
, the signal to noise ratio(SNR) ranges from 0-20.
As shown in table 4, we find transformer encounters large performance degradation as the number of repetition increases, especially on the , setup. Most of errors are insertion errors, which come from lots of cases where more number of repetitions is predicted, there are also some cases where less number of repetition is predicted, but not many. We suspect the degradation is due to the decoding confusion brought by the repeated similar decoder states, which makes transformer hard to predict the label of end of sentence at the correct step. In contrast, the CIF-based model performs stable on the repeated utterances and just encounters slightly performance degradation. Besides, as the repetition times become to , , , we find the inference time becomes to , , for the CIF-based model, and becomes to , , for transformer on Librispeech, which is basically consistent with the computational complexity for the two models. On the noisy utterances, the two models show similar performance.
The poor generalization of transformer on above cases can be explained from two aspects: 1) transformer uses its decoder state as the query of the encoded representations () to extract the relevant acoustic information, thus its recognition is affected by both of the decoder state and the , the changes of the two on special cases will bring great challenges to its recognition. In contrast, the integration of acoustic information by CIF-based model is just affected by the . 2) transformer mainly relies on the dependency between the decoder states to determine whether to stop, which is hard to achieve when it suffers from the decoding confusion on some cases. In contrast, the CIF-based model has a clear stop signal that is the last frame is processed. Based on above, we believe the type of frame-synchronous models may be more suitable for the ASR application scenarios that need cover wider range of speech.
In this work, we make a detailed comparison on a label-synchronous model (transformer) and a soft frame-synchronous model (CIF-based model). Through the experiments on multiple datasets, we find: 1) the label-synchronous model achieves slightly better accuracy on most of datasets, which can be attributed to its comprehensive usage of acoustic information and its insensitivity to the clearness of boundary; 2) the frame-synchronous model achieves 4.4-6.8 times faster inference speed, which is determined by its frame-by-frame calculation and linear computation complexity; 3) the frame-synchronous model also achieves better generalization on the special cases. Since there must be one driving end in the ASR models, we hope the comparison on the two types of models could benefit to the selection and design of end-to-end models in the practical ASR application.
-  D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in International Conference on Learning Representations (ICLR), 2015.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems (NeurIPS), 2017, pp. 5998–6008.
-  J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-based models for speech recognition,” in Advances in Neural Information Processing Systems (NeurIPS), 2015, pp. 577–585.
-  W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 4960–4964.
-  N. Jaitly, D. Sussillo, Q. V. Le, O. Vinyals, I. Sutskever, and S. Bengio, “A neural transducer,” arXiv preprint arXiv:1511.04868, 2015.
C. Raffel, M.-T. Luong, P. J. Liu, R. J. Weiss, and D. Eck, “Online and
linear-time attention by enforcing monotonic alignments,” in
International Conference on Machine Learning, 2017, pp. 2837–2846.
-  C.-C. Chiu and C. Raffel, “Monotonic chunkwise attention,” arXiv preprint arXiv:1712.05382, 2017.
-  J. Hou, S. Zhang, and L.-R. Dai, “Gaussian prediction based attention for online end-to-end speech recognition.” in Annual Conference of the International Speech Communication Association (INTERSPEECH), 2017, pp. 3692–3696.
-  L. Dong, S. Xu, and B. Xu, “Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5884–5888.
-  S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang, M. Someki, N. E. Y. Soplin, R. Yamamoto, X. Wang et al., “A comparative study on transformer vs rnn in speech applications,” arXiv preprint arXiv:1909.06317, 2019.
-  H. Miao, G. Cheng, C. Gao, P. Zhang, and Y. Yan, “Transformer-based online ctc/attention end-to-end speech recognition architecture,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6084–6088.
-  X. Chang, W. Zhang, Y. Qian, J. L. Roux, and S. Watanabe, “Mimo-speech: End-to-end multi-channel multi-speaker speech recognition,” arXiv preprint arXiv:1910.06522, 2019.
-  G. Pundak, T. N. Sainath, R. Prabhavalkar, A. Kannan, and D. Zhao, “Deep context: end-to-end contextual speech recognition,” in IEEE Spoken Language Technology Workshop (SLT), 2018, pp. 418–425.
-  A. Bruguier, R. Prabhavalkar, G. Pundak, and T. N. Sainath, “Phoebe: Pronunciation-aware contextualization for end-to-end speech recognition,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6171–6175.
-  E. Battenberg, J. Chen, R. Child, A. Coates, Y. G. Y. Li, H. Liu, S. Satheesh, A. Sriram, and Z. Zhu, “Exploring neural transducers for end-to-end speech recognition,” in IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2017, pp. 206–213.
A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” inInternational Conference on Machine Learning (ICML), 2006, pp. 369–376.
-  A. Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
-  H. Sak, M. Shannon, K. Rao, and F. Beaufays, “Recurrent neural aligner: An encoder-decoder neural network model for sequence to sequence mapping,” in Annual Conference of the International Speech Communication Association (INTERSPEECH), 2017, pp. 1298–1302.
-  N. Moritz, T. Hori, and J. Le Roux, “Triggered attention for end-to-end speech recognition,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 5666–5670.
-  M. Li, M. Liu, and H. Masanori, “End-to-end speech recognition with adaptive computation steps,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6246–6250.
-  L. Dong and B. Xu, “Cif: Continuous integrate-and-fire for end-to-end speech recognition,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6079–6083.
-  R. Prabhavalkar, K. Rao, T. N. Sainath, B. Li, L. Johnson, and N. Jaitly, “A comparison of sequence-to-sequence models for speech recognition.” in Annual Conference of the International Speech Communication Association (INTERSPEECH), 2017, pp. 939–943.
-  L. Dong, F. Wang, and B. Xu, “Self-attention aligner: A latency-control end-to-end model for asr using self-attention network and chunk-hopping,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 5656–5660.
-  V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210.
-  J. Du, X. Na, X. Liu, and H. Bu, “Aishell-2: transforming mandarin asr research into industrial scale,” arXiv preprint arXiv:1808.10583, 2018.
-  Y. Liu, P. Fung, Y. Yang, C. Cieri, S. Huang, and D. Graff, “Hkust/mts: A very large scale mandarin telephone speech corpus,” in International Symposium on Chinese Spoken Language Processing (ISCSLP), 2006, pp. 724–735.
-  D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” pp. 2613–2617, 2019.
-  J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
-  M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
-  J. Chorowski and N. Jaitly, “Towards better decoding and language model integration in sequence to sequence models,” pp. 523–527, 2017.
-  D. Duckworth, A. Neelakantan, B. Goodrich, L. Kaiser, and S. Bengio, “Parallel scheduled sampling,” arXiv preprint arXiv:1906.04331, 2019.
-  N. Moritz, T. Hori, and J. L. Roux, “Streaming automatic speech recognition with the transformer model,” arXiv preprint arXiv:2001.02674, 2020.
-  C.-C. Chiu, W. Han, Y. Zhang, R. Pang, S. Kishchenko, P. Nguyen, A. Narayanan, H. Liao, S. Zhang, A. Kannan et al., “A comparison of end-to-end models for long-form speech recognition,” arXiv preprint arXiv:1911.02242, 2019.
-  P. Zhou, R. Fan, W. Chen, and J. Jia, “Improving generalization of transformer for speech recognition with parallel schedule sampling and relative positional embedding,” arXiv preprint arXiv:1911.00203, 2019.
-  Y. Xu, J. Du, Z. Huang, L.-R. Dai, and C.-H. Lee, “Multi-objective learning and mask-based post-processing for deep neural network based speech enhancement,” arXiv preprint arXiv:1703.07172, 2017.