The prevailing end-to-end models are driving automatic speech recognition (ASR) system to become more simplified and practical. These models could directly transform the input speech features to the output text by a single neural network, which integrates the functions of the acoustic model, the acoustic-to-linguistic alignment mechanism and the language model into one. Two main branches are gradually formed with the evolvement of the end-to-end models. One of them uses the connectionist temporal classification (CTC)[Graves et al., 2006] or its extensions (RNN-T [Graves, 2012], RNA [Sak et al., 2017]
) as the alignment mechanism, which generates a ‘hard’ and ‘monotonic’ alignment with each frame labeled (may by the blank label) and is optimized by maximizing the sum of the probabilities of generating the alignment that is able to map to the targets. The other branch is under the encoder-decoder framework and uses the attention based alignment mechanism[Chan et al., 2016, Jaitly et al., 2016], which generates a ‘soft’ but ‘non-monotonic’ alignment by calculating a weight for each position in the globally [Chan et al., 2016] or locally [Jaitly et al., 2016] encoded representations, then extracting the sum of weighted representations for decoding.
Despite the great success, these mainstream end-to-end models neglect the positioning of token (word, word piece, etc.) boundaries from continuous speech, which, however, is considered to be a crucial step in the language learning of infants [Jusczyk, 1999]. The importance of locating boundaries is also reflected in the foreign language learning, where many learners may experience difficulty in hearing where one word ends and another begins, thereby leading to error prone recognition to speech content. Moreover, positioning to a token boundary can be regarded as a pre-step for instant speech recognizing, which is required by various online ASR scenarios. Based on these findings, we believe it is worthwhile to explore an alignment mechanism that addresses the positioning of token boundaries by forwardly integrating the acoustic information, and fires the integrated information for instant recognizing once a boundary is located.
There is a large similarities between the above alignment mechanism and the integrate-and-fire neuron model [Lapicque, 1907, Abbott, 1999], one of the most canonical models for analyzing the behaviour of neural system [Burkitt, 2006]
, which works by integrating the stimulations from the input signal during a period and firing an action potential (spike) when its membrane potential reaches a threshold value. However, the discontinuous spike information retard the penetration of the integrate-and-fire idea to the end-to-end models that are optimized with back-propagation. Here, we take a small step forward by simulating the process of the integrate-and-fire using the vector information processed on the continuous functions.
In this work, we propose Continuous Integrate-and-Fire (CIF), a novel ‘soft’ and ‘monotonic’ alignment mechanism to be utilized in the encoder-decoder framework. As the connection between the encoder and decoder, it first calculates a weight (which means the amount of acoustic information) for each incoming encoded acoustic representation. Then, it forwardly integrates the information in the acoustic representations until the accumulated weight reaches a threshold, that means a boundary is located. At this point, it divides the information in this boundary frame into two: one for filling the integration of the current token and the other for the subsequent integration, which mimics the processing of the integrate-and-fire neuron model when its membrane potential reaches a threshold at some point during a period of an encoded frame. After that, it fires the integrated acoustic information to the decoder to predict current token. Such process is illustrated in the Figure 1 (c) and loops over until the end of the recognition.
In the process of implementing the CIF-based model, the inaccurate positioning often occurs and brings difficulties to both of the training and inference. In the training, it may cause the unequal length between the predicted tokens and the targeted tokens, thus hindering the cross-entropy training. To solve that, we introduce a scaling strategy on the calculated weights to teacher-force the CIF to produce the same number of tokens as the target during training. We also present a loss function to supervise the quantity of the produced tokens to be closer to the quantity of target for better positioning. In the inference, the inaccurate positioning causes some useful but insufficient information being left at the tail, which leads to the appearance of incomplete words at the end of recognition result. To alleviate that, we present a rounding method to decide whether to make an additional firing based on the residual weights during inference, and introduce an extra token to the tail of the target sequence to mark the end of sentence and provide tolerance during training.
Evaluated on multiple ASR datasets covering different languages and speech types, the CIF-based model shows stable convergence and competitive performance. On the Librispeech dataset, the text is converted to the sequence of word pieces, although there are relatively blurred boundaries between the word pieces, the CIF-based model still achieves a word error rate (WER) of 3.70% on test-clean, which matches the result of most end-to-end models while keeping the potential of instant speech recognition and using lower encoded frame rate. On the Mandarin ASR datasets, the CIF-based model exhibit impressive performance due to the relatively clear boundaries between Mandarin character, specifically, it achieves a new state-of-the-art character error rate (CER) of 6.69% on test-ios of the read dataset AISHELL-2 and a competitive CER of 24.71% on the spontaneous dataset HKUST.
2 Related Work
The soft and monotonic acoustic-to-linguistic alignment mechanism may be a preferable choice for the end-to-end model in ASR. On the one hand, the soft characteristic enables the model to extract information from the relevant acoustic representations based on the calculated weights, thus utilizing the acoustic information more directly and comprehensively. On the other hand, the monotonic characteristic fits with the left-to-right nature of the ASR task, thus enabling the model to conduct instant speech recognition and run in lower computational complexity due to the alleviation of some calculations on the irrelevant positions.
assumes the alignment to be a forward-moving window that fits gaussian distribution, where the center and width of the gaussian window are predicted by its decoder state. Differing from them, the CIF neither introduces such assumption nor uses the state of the decoder, thus encouraging more pattern learning from the acoustic data without the assumption restriction. In addition, the CIF provides a concise calculation process by conducting the locating and integrating at the same time, rather than in[Chiu and Raffel, 2017, Fan et al., 2018], which performs soft attention over small chunks of memory preceding where a hard monotonic attention mechanism decides to stop. Besides, the CIF-based model is trained from scratch and doesn’t need a trained CTC model to conduct pre-partition before decoding like [Moritz et al., 2019]. In [Li et al., 2019a], Li al. present the important Adaptive Computation Steps (ACS) algorithm to dynamically decide how many frames should be processed to predict a linguistic output. Their method is like locating ‘hard’ boundaries at the encoded frame level, which causes the insufficient usage of acoustic information in the boundary frame. In contrast, the CIF mimics the integrate-and-fire neuron model and believes locating to the boundary (firing) occurs at some point in the period of an encoded frame, thus locating ‘soft’ boundaries at a finer time granularity (inside the encoded frame) and integrating the acoustic information more sufficiently. Besides, the ACS didn’t present constructive solutions to the inaccurate computed frames, these problems are probably the main reason why their model has a huge performance gap from a DNN-HMM model. In comparison, our model introduces multiple strategies to alleviate the difficulties brought by the inaccurate positioning of the CIF, thus supporting effective training and showing competitive performance.
3 Model Architecture
Continuous Integrate-and-Fire (CIF) is a ‘soft’ and ‘monotonic’ alignment mechanism employed in the encoder-decoder architecture. It is suitable for many sequence transduction tasks with the left-to-right nature (ASR, scene text recognition, grapheme-to-phoneme, etc.). In this paper, we focus on the ASR task and illustrate the architecture of our CIF-based model in Figure 2.
As shown in the Figure 2, the encoder transforms the speech features to the high-level acoustic representations , where due to the temporal down-sampling. Then, the CIF part consumes in the left-to-right manner to produce the integrated acoustic representations , where could be regarded as the acoustic embedding of the token in the output sequence . When
is produced, the decoder takes it and maps it to the probability distribution over the token. Three loss functions are placed on the encoder, the CIF part and the decoder respectively to offer sufficient supervision for the training. Besides, an external language model is incorporated to further improve the model performance. More details are described in the following sections.
The encoder uses a convolutional front-end and a pyramid structure composed of self-attention networks (SANs) that have shown the competitiveness in ASR [Salazar et al., 2019, Dong et al., 2019]. The convolutional front-end employs the powerful structure in [Dong et al., 2018]
that utilizes a 2-dimensional strided convolutional network to conduct temporal down-sampling by 2, and a multiplicative unit (MU)[Kalchbrenner et al., 2016] to further capture acoustic details. Then the 2-dimensional outputs are flattened and projected to as the input of the pyramid structure composed of SANs. Two temporal pooling layers with width 2 are uniformly inserted between the stacked SANs to encourage effective encoding in each temporal resolution, and they further reduce the original temporal sampling rate to 1/8 and bring lighter learning and inference. After the modeling of the pyramid structure, the encoded acoustic representations are obtained.
3.2 Continuous Integrate-and-Fire
The Continuous Integrate-and-Fire (CIF) part produces the acoustic embeddings of the output sequence by integrating the information in step by step. Specifically, at step , it first calculates a weight for the incoming encoded representation , where the weight means the amount of acoustic information hidden in , and is calculated by first using a 1-dimensional convolutions to capture the local dependencies around then using a projection layer with sigmoid activation to extract the scalar between 0 and 1.
To determine whether a boundary is located at step k, the weight is added to the previous residual weights to get the current accumulated weights . If is less than the given threshold value , no boundary is located and is used as the residual weights for the next step, the current integrated state is updated as and is used as the residual state for the next step. If is greater than , it means a token boundary is located and is set to 1, the calculation of , and is as follows:
where is fired to the decoder as the integrated acoustic information corresponding to current token . Above calculations are looped over till to the end of the utterance, and make the CIF-based model perform under a linear-time complexity of .
In the training, the length of the produced may differ from the length of the targets , thus bringing difficulties for the cross-entropy training. To solve this problem, we introduce a scaling strategy to multiply the calculated weights by a scalar to generate the scaled weights whose sum is equal to , thus teacher-forcing the CIF to produce with length for more effective training.
In the inference, there leaves some weights that are not enough to trigger one firing but useful at the tail of utterance, which may cause the appearance of incomplete words at the end of predictions. To alleviate such tail problem, we present a rounding method to make an additional firing if the last residual weight is greater than 0.5 during inference. We also introduce a token <EOS> to the tail of the target sequence to mark the end of sentence and provide tolerance during training.
The decoder also uses the SANs to capture the positional dependencies. Two versions of decoder are investigated in this work. Figure 2 shows our better performing version: the autoregressive (AR) decoder, which follows the decoder networks in [Dong et al., 2019] and models the probability distribution of as follows:
However, such autoregressive property leads to low parallelization and slow inference speed. To alleviate that, we introduce a non-autoregressive (NAR) decoder, which just inputs to the SANs to generate the probability distributions of independently in parallel as follows:
3.4 Loss functions
In addition to the cross entropy loss , we introduce two auxiliary loss functions to introduce more supervision for better training. Specifically, we place a CTC loss function on the encoder (similar to Kim et al. ) to promote the left-to-right acoustic modelling. Besides, we introduce a loss function on the CIF part to supervise the boundary positioning and make the quantity of predicted tokens closer to the quantity in the target. We term it as quantity loss , which is defined as , where is the quantity of the targeted tokens. Thus our model is trained under the multi-task learning as follows:
where and are tunable hyper-parameters.
3.5 Incorporating with Language Model
To further boost the performance of our model, we incorporate an SAN-based language model (LM) by performing second-pass using means of log-linear interpolation in[Chiu et al., 2018]. Given the hypotheses produced by beam search, we determine the final transcript as:
where is a hyper-parameter tuned on the development dataset.
4.1 Experimental Setup
We evaluate our approach on three public ASR datasets including two read speech corpora (Librispeech [Panayotov et al., 2015] and AISHELL-2 [Du et al., 2018]) and a spontaneous speech corpus (HKUST [Liu et al., 2006]). On the Librispeech, we use all the available train data (960 hours) for training, put the two development subsets together for validation, and leave the two test subsets only for evaluation. Beside, we train our language model on the separately prepared language-model training data, which is available on 111http://www.openslr.org/11 together with the above speech data. On the Mandarin ASR dataset AISHELL-2, we use all the available train data (1000 hours) for training, put the three development subsets together for validation, and leave the three test subsets only for evaluation. The speech data of AISHELL-2 is now available through an application process on 222http://www.aishelltech.com/aishell_2. The HKUST corpus (LDC2005S15, LDC2005T32) consists of a training set and a development set, which adds up to about 178 hours of telephone conversation Mandarin speech. We extract about 5 hours from the original training set for tuning the hyper-parameters, use the left training data for training, and use the original development set only for evaluation. In addition, the training of language model on HKUST and AISHELL-2 only uses the text data from its respective training set.
We extract input features using the Kaldi [Povey et al., 2011] recipe, specifically, we extract the 40-dimensional mel-filterbanks from a 25ms sliding window with a 10ms shift, then extend with delta and delta-delta, the per-speaker normalization and the global normalization for all the three datasets. We also perform speed perturbation [Ko et al., 2015] method with fixed 10% to conduct data augmentation. As for the output token, we use the word piece for Librispeech and use the character for AISHELL-2 and HKUST. Specifically, we use the BPE 333https://github.com/rsennrich/subword-nmt [Sennrich et al., 2015]
toolkit generating 3722 word pieces from the training set of Librispeech by setting the number of merge operations to 7500. Plus the blank label
We implement our model on the TensorFlow[Girija, 2016] platform. The self-attention networks (SANs) in our model leverage the implementation in [Dong et al., 2019] and use , , for all datasets. In the encoder, the convolutional front-end performs the same configures in [Dong et al., 2019], and the in the pyramid structure is set to 5 for all datasets. For fair comparison with other results, our encoder calculates bi-directionally. The forward encoding is applicable by introducing reasonable masking in SANs, which is left as our future work. In the CIF part, the number of filters in the 1-dimensional convolutional layer is set to , the convolutional width is set to 3 for two Mandarin datasets and is set to 5 for Librispeech. Layer normalization [Ba et al., 2016]
and a ReLU activation is applied after the convolutions. The firing thresholdis set to 0.9. In the decoder, the number of layer is set to 2 for two Mandarin datasets and to 3 for Librispeech. The multi-task hyper-parameters is set to 0.5 for two Mandarin datasets and to 0.25 for Librispeech (make the CTC loss value be about 0.5-0.7 of the CE loss), and is set to 1.0 for all datasets. The language models (LM) in our experiments are also constructed using the SANs whose , and keep same as the encoder-decoder model. The number of SAN layers is set to 3, 6, 15 for HKUST, AISHELL-2 and Librispeech, respectively.
In the training, we batch the training data with approximate number of frames together and let each batch contain about 20000 frames. We use the optimizer and the varied learning rate formula in [Vaswani et al., 2017], where is set to 25000 for Librispeech and AISHELL-2 and is set to 16000 for HKUST, the global coefficient on varied learning rate is set to 4.0. We use two regularization including dropout and label smoothing. We only apply dropout to the self-attention networks (SANs) whose attention dropout and residual dropout is set to 0.1 for Librispeech and AISHELL-2 and is set to 0.2 for HKUST. We use the uniform label smoothing in [Chorowski and Jaitly, 2016] and set it to 0.2. In the training of the language model, the two dropout is set to 0.2 and the uniform label smoothing is set to 0.2 for all datasets. Scheduled Sampling [Bengio et al., 2015] with constant sampling probability of 0.5 is applied on two Mandarin datasets. After training, we average the newest 10 checkpoints for inference.
In the inference, we use beam search with beam size 10 for all datasets. The hyper-parameter
for incorporating language model is set to 0.2, 0.3, 0.9 for HKUST, AISHELL-2 and Librispeech, respectively. We evaluate the result using word error rate (WER) for Librispeech and character error rate (CER) for two Mandarin datasets. All the experiments are evaluated by running 3 times, and all the experimental results are presented in the form of mean and standard deviation as follows.
4.2 Experimental Results
4.2.1 Results on Read Speech
Since the characters of Mandarin are single syllable and have relatively clear boundary, we first evaluate our model on the AISHELL-2, which is known as the largest public Mandarin ASR dataset and is released recently. As shown in the Table 1, the CIF-based model performs competitive on all of the test sets and significantly improves the result achieved by the Chain model [Povey et al., 2016].
|Chain-TDNN [Povey et al., 2016]||No||9.59||8.81||10.87|
|CIF-based model||Yes||7.25 0.06||6.69 0.02||7.47 0.06|
We further evaluate our CIF-based model on the English ASR dataset: Librispeech. Since we use the word pieces as the output tokens, the acoustic boundary between adjacent output tokens may be blurred. Even so, our model still shows competitive results that match most of end-to-end models. Specifically, our model achieves a mean WER of 4.48% on test-clean and a mean WER of 12.62% on test-other, which are comparable to the result of 4.1% and 12.5% achieved by current state-of-the-art LAS model without the powerful data augmentation by SpecAugment [Park et al., 2019] (which is released recently and will be applied to further boost our model performance in future work). Besides, our model still keeps the potential of instant speech recognition and utilizes lower encoded frame rate (12.5 Hz), thus may be more practical in various ASR scenarios. Compared with other soft and monotonic alignments that instantly recognize speech, the CIF-based model shows significant performance advantages. Especially, it achieves a huge absolute WER improvements on the result of Adaptive Computation Steps [Li et al., 2019a] which is reproduced by utilizing the same model setting, thus further demonstrating the superiority of the CIF which locates and integrates at a finer time granularity.
|Model||Params||w/o LM||w/ LM||w/o LM||w/ LM|
|LAS + SpecAugment [Park et al., 2019]||-||2.8||2.5||6.8||5.8|
|Jasper [Li et al., 2019b]||333 M||3.86||2.95||11.95||8.79|
|wav2letter++ [Vineel Pratap, 2018]||-||-||3.44||-||11.24|
|LAS + Deep bLSTM [Zeyer et al., 2018]||150 M||4.87||3.82||15.39||12.76|
|ASG + Gated ConvNet [Liptchinsky et al., 2017]||208 M||6.7||4.8||20.8||14.5|
|CTC + policy learning [Zhou et al., 2018a]||75 M||-||5.42||-||14.70|
|CTC + i-SRU 1D-Conv [Park et al., 2018]||36 M||-||5.73||-||15.96|
|‘Soft’ and ‘monotonic’:|
|ACS [Li et al., 2019a]||67M||16.72 0.07||16.11 0.03||24.09 0.25||22.66 0.30|
|Triggered Attention [Moritz et al., 2019]||-||7.4||5.7||19.2||16.1|
|CIF-based model||67M||4.48 0.09||3.70 0.10||12.62 0.09||10.90 0.16|
The achieved results on the two read datasets reflect that the CIF-based model has the potential to cover different languages with relatively clear or blurred token boundary.
4.2.2 Ablation Study
|without scaling strategy||6.03 0.18||14.98 0.08|
|without quantity loss||8.84 0.76||15.49 0.44|
|without handling tail||6.04 0.02||14.11 0.07|
|without CTC loss||4.96 0.06||13.27 0.16|
|without autoregressive||9.27 0.18||21.56 0.15|
|Full Model||4.48 0.09||12.62 0.09|
In this section, we use ablation study to evaluate the importance of different methods in our CIF-based model. As shown in Table 3, all of the introduced methods have a positive impact on the modelling of CIF-based model. The most crucial one is the auto-regression in decoder, which explicitly captures the language dependency that is required in ASR. The quantity loss used to supervise the boundary positioning also greatly matters, since obvious performance degradation and instability are produced after ablating it. The scaling strategy and the methods of handling tail are proposed to alleviate the problem brought by the inaccurate positioning. In line with our expectations, they provide significant and stable improvements to the CIF-based model. With the joint action of these methods, the CIF-based model conducts better boundary positioning and shows significant improvements.
4.2.3 Results on Conversational Speech
We further evaluate our model on a telephone conversional speech dataset (HKUST). As shown in Table 4, our CIF-based model achieves a competitive CER on the spontaneous speech which is not as well structured acoustically and linguistically as the read speech and has less clear boundary. The achieved performance further demonstrates the generalization of our CIF alignment mechanism.
In this work, we get inspirations from the integrate-and-fire neuron model and propose Continuous Integrate-and-Fire (CIF), a soft and monotonic alignment mechanism that supports instant speech recognition by forwardly integrating the acoustic information and firing the integrated information once a token boundary is located. Since it mimics the integrate-and-fire neuron model, it locates and integrates at a finer time granularity (inside the encoded frame, or from another perspective, at the continuous speech that is framed), thus enabling the model to sufficiently utilize the acoustic information and perform an effective and concise calculation process under a linear-time complexity.
In the future, we will further validate the performance of CIF-based model on larger-scale ASR dataset and other monotonic sequence transduction tasks. Besides, we will continue to find inspirations from biologically-inspired neuron models (e.g. integrate-and-fire neuron family) to further boost the practicality of CIF-based model. We also hope this work could send some useful ideas to the construction of biologically-plausible ASR system, and may put a foot in the research of such field.
Graves et al. 
Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks.In
Proceedings of the 23rd international conference on Machine learning, pages 369–376. ACM, 2006.
- Graves  Alex Graves. Sequence transduction with recurrent neural networks. Computer Science, 58(3):235–242, 2012.
- Sak et al.  Hasim Sak, Matt Shannon, Kanishka Rao, and Françoise Beaufays. Recurrent neural aligner: An encoder-decoder neural network model for sequence to sequence mapping. In Proc. of Interspeech, 2017.
- Chan et al.  William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 4960–4964, 2016.
- Jaitly et al.  Navdeep Jaitly, Quoc V Le, Oriol Vinyals, Ilya Sutskever, David Sussillo, and Samy Bengio. An online sequence-to-sequence model using partial conditioning. In Advances in Neural Information Processing Systems, pages 5067–5075, 2016.
- Jusczyk  Peter W Jusczyk. How infants begin to extract words from speech. Trends in cognitive sciences, 3(9):323–328, 1999.
- Lapicque  Louis Lapicque. Recherches quantitatives sur l’excitation electrique des nerfs traitee comme une polarization. Journal de Physiologie et de Pathologie Generalej, 9:620–635, 1907.
- Abbott  Larry F Abbott. Lapicque’s introduction of the integrate-and-fire model neuron (1907). Brain research bulletin, 50(5-6):303–304, 1999.
- Burkitt  Anthony N Burkitt. A review of the integrate-and-fire neuron model: I. homogeneous synaptic input. Biological cybernetics, 95(1):1–19, 2006.
- Hou et al.  Junfeng Hou, Shiliang Zhang, and Li-Rong Dai. Gaussian prediction based attention for online end-to-end speech recognition. In INTERSPEECH, pages 3692–3696, 2017.
- Tjandra et al.  Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura. Local monotonic attention mechanism for end-to-end speech and language processing. arXiv preprint arXiv:1705.08091, 2017.
- Chiu and Raffel  Chung-Cheng Chiu and Colin Raffel. Monotonic chunkwise attention. arXiv preprint arXiv:1712.05382, 2017.
- Fan et al.  Ruchao Fan, Pan Zhou, Wei Chen, Jia Jia, and Gang Liu. An online attention-based model for speech recognition. arXiv preprint arXiv:1811.05247, 2018.
- Moritz et al.  Niko Moritz, Takaaki Hori, and Jonathan Le Roux. Triggered attention for end-to-end speech recognition. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5666–5670. IEEE, 2019.
- Li et al. [2019a] Mohan Li, Min Liu, and Hattori Masanori. End-to-end speech recognition with adaptive computation steps. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6246–6250. IEEE, 2019a.
- Salazar et al.  Julian Salazar, Katrin Kirchhoff, and Zhiheng Huang. Self-attention networks for connectionist temporal classification in speech recognition. arXiv preprint arXiv:1901.10055, 2019.
- Dong et al.  Linhao Dong, Feng Wang, and Bo Xu. Self-attention aligner: A latency-control end-to-end model for asr using self-attention network and chunk-hopping. arXiv preprint arXiv:1902.06450, 2019.
- Dong et al.  Linhao Dong, Shiyu Zhou, Wei Chen, and Bo Xu. Extending recurrent neural aligner for streaming end-to-end speech recognition in mandarin. arXiv preprint arXiv:1806.06342, 2018.
- Kalchbrenner et al.  Nal Kalchbrenner, Aaron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Video pixel networks. arXiv preprint arXiv:1610.00527, 2016.
- Kim et al.  Suyoun Kim, Takaaki Hori, and Shinji Watanabe. Joint ctc-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4835–4839. IEEE, 2017.
- Chiu et al.  Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J Weiss, Kanishka Rao, Ekaterina Gonina, et al. State-of-the-art speech recognition with sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4774–4778. IEEE, 2018.
- Panayotov et al.  Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210. IEEE, 2015.
- Du et al.  Jiayu Du, Xingyu Na, Xuechen Liu, and Hui Bu. Aishell-2: Transforming mandarin asr research into industrial scale. arXiv preprint arXiv:1808.10583, 2018.
- Liu et al.  Yi Liu, Pascale Fung, Yongsheng Yang, Christopher Cieri, Shudong Huang, and David Graff. Hkust/mts: A very large scale mandarin telephone speech corpus. In Chinese Spoken Language Processing, pages 724–735. Springer, 2006.
- Povey et al.  Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al. The kaldi speech recognition toolkit. Technical report, IEEE Signal Processing Society, 2011.
- Ko et al.  Tom Ko, Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur. Audio augmentation for speech recognition. In Sixteenth Annual Conference of the International Speech Communication Association, 2015.
- Sennrich et al.  Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
- Girija  Sanjay Surendranath Girija. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. Software available from tensorflow. org, 2016.
- Ba et al.  Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Vaswani et al.  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
- Chorowski and Jaitly  Jan Chorowski and Navdeep Jaitly. Towards better decoding and language model integration in sequence to sequence models. arXiv preprint arXiv:1612.02695, 2016.
- Bengio et al.  Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171–1179, 2015.
- Povey et al.  Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, Pegah Ghahremani, Vimal Manohar, Xingyu Na, Yiming Wang, and Sanjeev Khudanpur. Purely sequence-trained neural networks for asr based on lattice-free mmi. In Interspeech, pages 2751–2755, 2016.
- Park et al.  Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779, 2019.
- Li et al. [2019b] Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan Leary, Oleksii Kuchaiev, Jonathan M Cohen, Huyen Nguyen, and Ravi Teja Gadde. Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288, 2019b.
- Vineel Pratap  Qiantong Xu Jeff Cai Jacob Kahn Gabriel Synnaeve Vitaliy Liptchinsky Ronan Collobert Vineel Pratap, Awni Hannun. wav2letter++: The fastest open-source speech recognition system. CoRR, abs/1812.07625, 2018. URL https://arxiv.org/abs/1812.07625.
- Zeyer et al.  Albert Zeyer, Kazuki Irie, Ralf Schlüter, and Hermann Ney. Improved training of end-to-end attention models for speech recognition. arXiv preprint arXiv:1805.03294, 2018.
- Liptchinsky et al.  Vitaliy Liptchinsky, Gabriel Synnaeve, and Ronan Collobert. Letter-based speech recognition with gated convnets. arXiv preprint arXiv:1712.09444, 2017.
- Zhou et al. [2018a] Yingbo Zhou, Caiming Xiong, and Richard Socher. Improving end-to-end speech recognition with policy learning. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5819–5823. IEEE, 2018a.
- Park et al.  Jinhwan Park, Yoonho Boo, Iksoo Choi, Sungho Shin, and Wonyong Sung. Fully neural network based speech recognition on mobile and embedded devices. In Advances in Neural Information Processing Systems, pages 10620–10630, 2018.
- Watanabe et al.  Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, et al. Espnet: End-to-end speech processing toolkit. arXiv preprint arXiv:1804.00015, 2018.
- Zhou et al. [2018b] Shiyu Zhou, Linhao Dong, Shuang Xu, and Bo Xu. A comparison of modeling units in sequence-to-sequence speech recognition with the transformer on mandarin chinese. arXiv preprint arXiv:1805.06239, 2018b.