Log In Sign Up

CAT: CRF-based ASR Toolkit

by   Keyu An, et al.
Tsinghua University

In this paper, we present a new open source toolkit for automatic speech recognition (ASR), named CAT (CRF-based ASR Toolkit). A key feature of CAT is discriminative training in the framework of conditional random field (CRF), particularly with connectionist temporal classification (CTC) inspired state topology. CAT contains a full-fledged implementation of CTC-CRF and provides a complete workflow for CRF-based end-to-end speech recognition. Evaluation results on Chinese and English benchmarks such as Switchboard and Aishell show that CAT obtains the state-of-the-art results among existing end-to-end models with less parameters, and is competitive compared with the hybrid DNN-HMM models. Towards flexibility, we show that i-vector based speaker-adapted recognition and latency control mechanism can be explored easily and effectively in CAT. We hope CAT, especially the CRF-based framework and software, will be of broad interest to the community, and can be further explored and improved.


page 1

page 2

page 3

page 4


ESPnet: End-to-End Speech Processing Toolkit

This paper introduces a new open source platform for end-to-end speech p...

Advancing CTC-CRF Based End-to-End Speech Recognition with Wordpieces and Conformers

Automatic speech recognition systems have been largely improved in the p...

PyChain: A Fully Parallelized PyTorch Implementation of LF-MMI for End-to-End ASR

We present PyChain, a fully parallelized PyTorch implementation of end-t...

Deformable TDNN with adaptive receptive fields for speech recognition

Time Delay Neural Networks (TDNNs) are widely used in both DNN-HMM based...

NCRF++: An Open-source Neural Sequence Labeling Toolkit

This paper describes NCRF++, a toolkit for neural sequence labeling. NCR...

Code Repositories

1 Introduction

In addition to theories and algorithms, open source toolkits make substantial contributions to automatic speech recognition (ASR) technologies. A good ASR toolkit is an integration of good algorithms, efficient implementations and manageable code bases. In recent years, significant advancement has been made by the application of deep neural networks (DNNs). Recent toolkits using DNNs such as Kaldi

[1], Eesen [2], Wav2letter++ [3] and ESPnet [4] promote the development and application of ASR technologies, with varying pipelines and concerns. Toolkits based on DNN-HMM hybrid systems like Kaldi [1] and RASR [5] achieve the state-of-the-art performance in terms of recognition accuracy, usually measured by word error rate (WER) or character error rate (CER). In contrast, end-to-end systems111We follow the definition of end-to-end in [6]

: “flat-start training of a single DNN in one stage without using any previously trained models, forced alignments, or building state-tying decision trees.”

put simplicity of the training pipeline at a higher priority and usually are data-hungry.

In this paper we present CAT (CRF-based ASR Toolkit)222, which aims at combining the advantages of the two kinds of systems. CAT advocates discriminative training in the framework of conditional random field (CRF), particularly with but not limited to connectionist temporal classification (CTC) [7] inspired state topology. The recently developed CTC-CRF (namely CRF with CTC topology) [8] has achieved superior benchmarking performance with training data ranging from 100 to

1000 hours, while being end-to-end with simplified pipeline and being data-efficient in the sense that cheaply available language models (LMs) can be leveraged effectively with or without a pronunciation lexicon.

Major features of CAT are as follows.

1. CAT contains a full-fledged implementation of CTC-CRF. A non-trivial issue is that the gradient in training CRFs is the difference between empirical expectation and model expectation, which both can be efficiently calculated by the forward-backward algorithm 333Calculating the two expectations is similar to calculations over the numerator graph and denominator graph in LF-MMI respectively [9].. CAT modifies warp-ctc [10] for fast parallel calculation of the empirical expectation, which resembles the CTC forward-backward calculation [7]. CAT calculates the model expectation using CUDA C/C++ interface, drawing inspiration from Kaldi’s implementation of denominator forward-backward calculation.


CAT adopts PyTorch

[11] to build DNNs and do automatic gradient computation, and so inherits the power of PyTorch in handling DNNs.

3. CAT provides a complete workflow for CRF-based end-to-end speech recognition. CAT provides complete training and testing scripts for a number of Chinese and English benchmarks and all the experimental results reported in this paper can be readily reproduced. Detailed documentation and code comments are also provided in CAT, making it easy to get start and obtain state-of-the-art baseline results even for beginners of ASR.

4. Evaluation results on major benchmarks such as Switchboard and Aishell show that CAT obtains the state-of-the-art results among existing end-to-end models with less parameters, and is competitive compared with the hybrid DNN-HMM models.

5. Towards flexibility, we show that i-vector based speaker-adapted recognition and latency control mechanism can be explored easily and effectively in CAT.

2 Related work

Currently, speech recognition models can be roughly divided into two categories: DNN-HMM hybrid models and end-to-end models. One of the feature of Kaldi is its efficient implementation of the lattice-free maximum-mutual-information (LF-MMI) model, which is a typical hybrid DNN-HMM model. The pipeline consists of initial GMM-HMM training, and iterative context tree building and forced alignment. End-to-end LF-MMI [6] (EE-LF-MMI) has been developed, with two versions of using mono-phones or tree-free bi-phones. The differences between EE-LF-MMI and CTC-CRF are detailed in [8]. It is shown in our experiments that CTC-CRF mono-phone system matches EE-LF-MMI bi-phone system.

End-to-end models aim to directly map the speech sequence (raw audio, spectrum features, etc.) to the label sequence (words, phonemes, etc.) with minimum intermediate components. Three main classes of end-to-end models are based on CTC [7], attention based Seq2Seq [12] and RNN-transducer (RNN-T) [13]

respectively. As representative toolkits, Eesen is based on regular CTC, and ESPnet relies on attention and adopts hybrid CTC/attention. Wav2letter++ is known for its efficiency with fast tensor operations and the use of pure convolutional neural networks in both acoustic and language models.

End-to-end models have received increasing interests and achieved performances close to the hybrid models on a few benchmarks, but still faces a number of issues. First, there is still a pronounced gap between attention end-to-end models and hybrid models in terms of recognition accuracy [14]. Second, the recognition accuracy of the hybrid models can be further boosted with classical speech recognition techniques such as i-vector based speaker-adapted recognition [15], while the application of these techniques to end-to-end systems have not been thoroughly explored. A third issue sometimes overlooked for end-to-end ASR toolkits is the demand for low latency recognition which is crucial for streaming ASR applications. CTC-based Eesen uses bidirectional models by default. In attention based end-to-end systems (e.g. ESPnet), bidirectional encoder and global soft attention present inherent difficulty for low latency. There are recent efforts such as using monotonic chunkwise attention (MoChA) [16], Latency-controlled BLSTM [17]. Online recognition with ESPnet has been recently studied [18]. Wav2letter++ is based solely on convolutional neural networks, which use restricted future context and realize low latency. However, in order to model long-range dependencies, the neural network in [19] is extremely deep and big (with 100 million parameters). Remarkably, time-delay neural networks (TDNNs) with interleaving LSTM layers (TDNN-LSTM) [20] is used in Kaldi to limit the latency while maintain the recognition accuracy.

Finally, for programming languages used in toolkits, Kaldi core primarily uses C++ which is efficient but not flexible in supporting various rapid developments in DNNs. PyTorch-Kaldi [21] builds the neural networks with PyTorch, and PyKaldi [22] allows users to interact with Kaldi and OpenFst via Python language.

3 CRF-based ASR

CAT conducts discriminative training of acoustic model (AM) based on conditional maximum likelihood [8]:


where is the speech feature sequence and is the label sequence. The label could be phone, character, word-piece and etc. Note that and are in different lengths and usually not aligned in training data. The alignment could be handled implicitly by attention or explicitly by introducing a hidden state sequence . CRF-based ASR takes the later approach, which is also taken in DNN-HMM hybrid, CTC and RNN-T. When introducing , state topology refers to the state transition structure in , which basically defines a mapping that maps a state sequence to a unique label sequence . Here denote the set of all sequences over the alphabet of labels, and similarly for the alphabet of states. It can be seen that HMM, CTC, and RNN-T implement different topologies444 These topologies can all be used in CRF-based ASR, which would be our future work. CAT toolkit currently supports CRF with CTC topology.. CTC topology defines a mapping that removes consecutive repetitive labels and blanks, with defined by adding a special blank symbol blk to . CTC topology is appealing, since it allows a minimum size of and avoids the inclusion of silence symbol, as discussed in [8].

Then the posteriori of can be defined through the posteriori of as follows:


And the posteriori of can be further defined by a CRF:


Here denotes the potential function of the CRF, defined as:

where . defines the node potential, calculated from the bottom DNN.

defines the edge potential, often realized by an n-gram LM of labels and, for reasons to be clear in the following, referred to as the denominator n-gram LM. Remarkably, regular CTC suffers from the conditional independence between the states in

. In contrast, by incorporating into the potential function in CRF-based ASR, this drawback is naturally avoided. Combing Eq. (1)-(3) yields the CRF objective function specifically as:


The gradient of the above objective function involves two gradients calculated from the numerator and denominator respectively, which essentially correspond to the two terms of empirical expectation and model expectation as commonly found in estimating CRFs. Similarly to LF-MMI, both terms can be efficiently obtained via the forward-backward algorithm, as further detailed in Section

4.3. Especially, the denominator calculation involves running the forward-backward algorithm over the denominator graph , represented as a weighted finite sate transducer (WFST). is an composition of the CTC topology WFST and the WFST representation of the n-gram LM of labels, which is called the denominator n-gram LM, to be differentiated from the word-level LM in decoding.

4 Implementation

CAT consists of separable AM and LM, which meets our rationale to be data-efficient. The AM training workflow in CAT is shown in the Fig 1. CAT uses SRILM for LM training, and some code from Eesen for decoding graph compiling and WFST based decoding.

4.1 Kaldi style feature extraction

CAT integrates Kaldi into the data preparation and feature extraction steps. Kaldi serves as a good reference, with complete modules for building ASR systems, covering steps from feature extraction to decoding and evaluation. Besides, the code organization of CAT also follows the Kaldi manner, which means 1) high-level workflows are expressed in shell scripts, and 2) the source code and the examples are separated, which facilitates the reuse of source code between exemplar systems over different speech datasets.

Figure 1: The AM training pipeline in CAT. The blue block stands for Kaldi style data preparation, the orange for neural network training with PyTorch, and the green for our C++ implementation of CRF objective function calculation.

4.2 PyTorch in acoustic modeling

In acoustic modeling, we use PyTorch to build various neural networks with Python language. PyTorch supports dynamic neural network building, facilitating the exploration of complex neural network architectures. We provide the implementation of commonly used neural networks in ASR, e.g. LSTM with its variants and TDNN-LSTM. A wide variety of new elements in deep learning applications such as dropout

[23] is available in PyTorch, with which we can easily incorporate those new elements in building ASR systems. Besides, PyTorch provides support for multiprocess parallelism on one or more machines, enabling us to conduct experiments with large-scale training data.

4.3 Numerator and Denominator calculation

The numerator calculation is similar to the regular CTC computation. We use warp-ctc [10] to calculate the numerator objective function in Eq. (4). The input of the warp-ctc is modified to be the log-softmax output of the bottom neural network, rather than the linear layer output. Note that for label sequence , also appears in the numerator but behaves like an constant. So is pre-calculated based on the denominator n-gram LM and saved, and then applied in training.

The denominator calculation is implemented via CUDA C/C++ on GPUs, drawing inspiration from LF-MMI in Kaldi. The main differences are that 1) our implementation is in the log domain, which is more stable than in the original numeric domain as in Kaldi. 2) Varying lengths of utterances are supported, rather than modifying the utterance length to one of 30 lengths; and we do not use the leaky-HMM trick.

4.4 WFST-based decoding

We adopt the WFST-based decoding in Eesen [2]. Notably, as [8] pointed out, the CTC topology WFST in Eesen (denoted by T.fst) is not correct with the inappropriate use of blk symbol, making the decoding graph mistakenly larger. The mistake is corrected in CAT.

As mentioned before, CAT adopts CTC topology instead of HMM topology, which benefits the decoding process in two aspects. First, the size of the symbol alphabet of CTC topology, which equals to the size of plus one (for blk), is far less than the alphabet size of HMM topology used in [6]. Second, as figured out in [7], the output of CTC is typically a series of spikes, separated by lots of ‘blanks’. Ignoring these frames with high blank scores does not affect beam decoding much, but the time cost will be reduced significantly. It is found in our experiments that ignoring frames with blank scores greater than 0.7 leads to 0.1% reduction in recognition accuracy on Switchboard, but the decoding time is 30% less than not ignoring. We also explore the application of the RNN language model in lattice rescoring.

5 Experiment Settings

The experiment consists of two parts. In the first part, we introduce the results on major benchmarks using state-of-the-art bidirectional recurrent networks. The second part presents our exploration on low-latency unidirectional models for streaming ASR.

5.1 Setup for benchmarking experiment

We compare the performance of CAT with state-of-the-art ASR systems on three open source benchmarks: 80hr WSJ, 170hr Aishell and 300hr Switchboard. Speed perturbation is used on all datasets to augment the training data. Unless otherwise stated, 40 dimension filter bank with delta and delta-delta features are used as input. The features are normalized via mean subtraction and variance normalization per utterance, and sampled by a factor of 3.

Unless otherwise stated, the acoustic model is two blocks of VGG layers followed by a 6-layer bidirectional LSTM (BLSTM) similar to [24]

. We apply 1D max-pooling to the feature maps produced by VGG blocks on the frequency dimension only, as the input features have been sampled in time-domain and we found that max-pooling along time dimension will deteriorate the recognition accuracy. The first VGG block has 3 input channels corresponding to spectral features, delta, and delta delta features. The BLSTM has 320 hidden units per direction for each layer and the total number of parameters is 16M, much smaller than most end-to-end models. For the training, a dropout

[23]probability of 50% is applied to the LSTM layer to prevent overfitting. Following [8], a CTC loss with a weight is combined with the CRF loss to help convergence. We set = 0.1 by default and we found in practice that the smaller is, the better the final result will be, but the convergence time will be a bit longer.

On Switchboard we also evaluate our model with speaker adaptation [15] and RNNLM lattice rescoring [25]. Speaker adaptation is an important component in state-of-the-art hybrid DNN-HMM recognizers, but end-to-end models tend to exclude speaker adaptation techniques to maintain the end-to-end training manner. Here we offer promising results of improvements from speaker adaptation techniques on end-to-end acoustic model. Specifically, we extract 100-dimensional i-vectors from MFCC features as in [26]. Thus in Switchboard experiments, the 120-dimensional MFCCs++ appended with the i-vector are used as the input for neural networks. We use MFCC features here to stay consistent with the i-vector extraction. The acoustic model is a 6-layer BLSTM with 220 input units. The VGG blocks are not used here.

5.2 Setup for latency control experiment

The key to achieving high-accuracy online speech recognition is to exploit limited but sufficient future information with appropriate mechanisms. For example, the model in [27] enables online-recognition by moving a window over the input stream of the bidirectional RNN, while [10] applies a modification to unidirectional RNN by employing a special layer named row convolution to provide a small portion of future information. We explore two methods for low latency. The first method is the use of VGG blocks mentioned previously, as the convolution layer can provide certain amount of future information. The second method is to use an structure similar to TDNN-LSTM proposed in [20]

. Results for standard unidirectional and bidirectional LSTM with the same number of layers and hidden units per layer (as in VGG-LSTM) are also provided for comparison. Offline cepstral mean and variance normalization (CMVN) is not used in online experiment because it requires a full sentence input to extract the mean and variance information. Instead, we use batch normalization to speed up convergence.

6 Experimental results

6.1 Results on major benchmarks

The WER results on WSJ are shown in Table 1. The evaluation dataset contains the dev93 and eval92. On eval92, our performance is comparable with hybrid LF-MMI and flat-start EE-LF-MMI, and better than all other end-to-end models. On the more difficult dev93 dataset, CAT is only slightly worse ( 7%) than hybrid LF-MMI which uses bi-phone context dependency modeling.

Platforms Unit LM dev93 eval92
EE-LF-MMI [6] mono-phone 4-gram 6.3% 3.1%
EE-LF-MMI [6] bi-phone 4-gram 6.0% 3.0%
LF-MMI [6] mono-phone 4-gram 6.0% 3.0%
LF-MMI [6] bi-phone 4-gram 5.3% 2.7%
Eesen [2] mono-phone 3-gram 10.87% 7.28%
ESPnet [4] mono-char LSTM 12.4% 8.9%
Wav2letter++ [19] mono-char 4-gram 9.5% 5.6%
Wav2letter++ [19] mono-char ConvLM 6.8% 3.5%
CAT mono-phone 4-gram 5.7% 3.2%
Table 1: WSJ results

The results on Aishell are shown in Table 2. We use CER to evaluate the performance on mandarin benchmark by convention. We do not use pitch feature as in [1] [4] [28], because the pitch feature is not suitable for composing a 3-channel feature map together with the fbank feature. The result shows that CAT obtains state-of-the-art performance on Aishell dataset, the CER is much better than other end-to-end models and even the hybrid Kaldi-chain model.

Model Unit LM Test
Kaldi (chain) [1] tri-phone 3-gram 7.43%
Kaldi (nnet3) [1] tri-phone 3-gram 8.64%
ESPnet [4] mono-char RNNLM 8.0%
attention [28] mono-char RNNLM 18.7%
attention [29] mono-char RNNLM 8.71%
CAT mono-phone 3-gram 6.34%
  • Using speaker adaptation with i-vectors

Table 2: Aishell results

The WER results on Switchboard are shown in Table 3. Eval2000 consisting of Swichboard evaluation dataset (SW) and Callhome evaluation dataset (CH) is used for evaluation. On Switchboard, our mono-phone system achieves a close performance to bi-phone EE-LF-MMI system and the results are significantly better than other end-to-end models and mono-phone hybrid LF-MMI system. Compared with bi-phone hybrid LF-MMI, our model is only slightly worse (less than 5%). The experiment also shows that the performance of our model in CAT can be further enhanced by techniques such as speaker adaptation and RNNLM rescoring.

Model Unit LM SW CH
EE-LF-MMI [6] mono-phone 4-gram 11.0% 20.7%
EE-LF-MMI [6] bi-phone 4-gram 9.8% 19.3%
LF-MMI [6] mono-phone 4-gram 10.7% 20.3%
LF-MMI [6] bi-phone 4-gram 9.5% 18.6%
Eesen [2] mono-phone 3-gram 14.8% 26.0%
Attention [30] subword No LM 13.5% 27.1%
Seq2Seq [12] subword LSTM 11.8% 25.7%
CAT mono-phone 4-gram 10.0% 19.2%
CAT mono-phone 4-gram 9.7% 18.8%
CAT mono-phone RNNLM 8.8% 17.4%
  • The experiment uses speaker adaptation with i-vectors so the training is no longer end-to-end.

Table 3: Switchboard results

We observe that the performance of our model can be further improved simply by increasing the number of the layers (6 to 7, e.g.) or the hidden units per layer (320 to 512, e.g.). Here we only list the results of the neural networks with manageable parameter sizes.

Dataset Model Future context Result
WSJ LSTM 0 6.78
BLSTM unlimited 5.9
VGG-LSTM 4 6.27
TDNN-LSTM 7 6.42
Aishell LSTM 0 8.99
BLSTM unlimited 7.30
VGG-LSTM 4 7.43
TDNN-LSTM 7 8.16
LSTM 0 17.7
BLSTM unlimited 14.6
VGG-LSTM 4 17.0
TDNN-LSTM 13 16.4
Table 4: Latency control experiment results.The experiment is evaluated on dev93 (WER) for WSJ, Aishell test set (CER) and Eval2000 (WER) for Switchboard.

6.2 Experiment for latency control models

The results for latency control models are shown in Table 4. The future context is measured by the number of future frames (after sampling) used at current frame. As we use 3*3 convolution in the VGG block and -1,0,1 layer-wise context in the TDNN layer, the future context actually depends on how many convolution or TDNN layers we use. We find that the use of VGG blocks and TDNN can alleviate the performance degradation caused by changing the bidirectional model into unidirectional models, without significant increase of the model parameters. Specifically, our unidirectional model is comparable with EE-LF-MMI on WSJ and Kaldi-chain model on Aishell. On Switchboard, our unidirectional model (16.4% WER) is weaker than EE-LF-MMI, but still far better than other end-to-end models (with offline birectional networks) and the online bidirectional network proposed in [27] (17.3% WER on Eval2000).

7 Conclusion

This paper introduces an end-to-end ASR toolkit - CAT, with the main features of CRF based discriminative training (especially with CTC topology), integrating PyTorch for DNN development, a complete workflow with reproducible examples, and superior results. CAT obtains the the state-of-the-art results among existing end-to-end models on several major benchmarks, and is comparable with the state-of-the-art hybrid systems. To show flexibility, we explore latency control models with various DNN architectures, which enable CAT to be used for streaming ASR. We hope CAT, especially the CRF-based framework and software, will be of broad interest to the community, and can be further explored and improved, e.g. the implementation of a more general CRF training framework with different topologies, and the application in more ASR tasks.


  • [1] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukáš Burget, Ondrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlíček, Yanmin Qian, Petr Schwarz, Jan Silovský, Georg Stemmer, and Karel Vesel, “The Kaldi speech recognition toolkit,” in Proc. ASRU, 2011, pp. 1764–1772.
  • [2] Y. Miao, M. Gowayyed, and F. Metze, “EESEN: End-to-end speech recognition using deep rnn models and wfst-based decoding,” in Proc. ASRU, 2015, pp. 167–174.
  • [3] Qiantong Xu Jeff Cai Jacob Kahn Gabriel Synnaeve Vitaliy Liptchinsky Ronan Collobert Vineel Pratap, Awni Hannun, “wav2letter++: The fastest open-source speech recognition system,” arXiv preprint arXiv:1609.03193, 2018.
  • [4] Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, and Tsubasa Ochiai, “ESPnet: End-to-end speech processing toolkit,” in Proc. Interspeech, 2018, pp. 2207–2211.
  • [5] S. Wiesler, A. Richard, P. Golik, R. Schlüter, and H. Ney, “RASR/NN: The RWTH neural network toolkit for speech recognition,” in Proc. ICASSP, 2014, pp. 3281–3285.
  • [6] Hossein Hadian, Hossein Sameti, Daniel Povey, and Sanjeev Khudanpur, “Flat-start single-stage discriminatively trained HMM-based models for ASR,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 11, pp. 1949–1961, 2018.
  • [7] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber,

    “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,”

    in Proc. ICML, 2006, pp. 369–376.
  • [8] Hongyu Xiang and Zhijian Ou, “CRF-based single-stage acoustic modeling with CTC topology,” in Proc. ICASSP, 2019, pp. 5676–5680.
  • [9] Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, Pegah Ghahremani, Vimal Manohar, Xingyu Na, Yiming Wang, and Sanjeev Khudanpur, “Purely sequence-trained neural networks for ASR based on lattice-free MMI,” in Proc. Interspeech, 2016, pp. 2751–2755.
  • [10] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang Bai, Eric Battenberg, and et al., “Deep speech 2 : End-to-end speech recognition in english and mandarin,” in ICML, 2015.
  • [11] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer, “Automatic differentiation in PyTorch,” in NIPS Autodiff Workshop, 2017.
  • [12] Albert Zeyer, Kazuki Irie, Ralf Schlüter, and Hermann Ney,

    “Improved training of end-to-end attention models for speech recognition,”

    in Proc. Interspeech, 2018, pp. 7–11.
  • [13] Alex Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
  • [14] Christoph Lüscher, Eugen Beck, Kazuki Irie, Markus Kitza, Wilfried Michel, Albert Zeyer, Ralf Schlüter, and Hermann Ney, “Rwth asr systems for librispeech: Hybrid vs attention,” in Proc. Interspeech, 2019, pp. 231–235.
  • [15] G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adaptation of neural network acoustic models using i-vectors,” in Proc. ASRU, 2013, pp. 55–59.
  • [16] Chung-Cheng Chiu and Colin Raffel, “Monotonic chunkwise attention,” 2018.
  • [17] Chen Kai and Huo Qiang, “Training deep bidirectional lstm acoustic model for lvcsr by a context-sensitive-chunk bptt approach,” IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 24, no. 7, pp. 1185–1193, 2016.
  • [18] Haoran Miao, Gaofeng Cheng, Pengyuan Zhang, Ta Li, and Yonghong Yan, “Online hybrid ctc/attention architecture for end-to-end speech recognition,” in Proc. Interspeech, 2019, pp. 2623–2627.
  • [19] Neil Zeghidour, Qiantong Xu, Vitaliy Liptchinsky, Nicolas Usunier, Gabriel Synnaeve, and Ronan Collobert, “Fully convolutional speech recognition,” arXiv preprint arXiv:1812.06864, 2018.
  • [20] V. Peddinti, Y. Wang, D. Povey, and S. Khudanpur, “Low latency acoustic modeling using temporal convolution and LSTMs,” IEEE Signal Processing Letters, vol. 25, no. 3, pp. 373–377, 2018.
  • [21] M. Ravanelli, T. Parcollet, and Y. Bengio, “The Pytorch-kaldi speech recognition toolkit,” in Proc. ICASSP, 2019, pp. 6465–6469.
  • [22] Dogan Can, Victor R. Martinez, Pavlos Papadopoulos, and Shrikanth S. Narayanan, “Pykaldi: A Python wrapper for Kaldi,” in Proc. ICASSP, 2018, pp. 5889–5893.
  • [23] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,”

    Journal of Machine Learning Research

    , vol. 15, pp. 1929–1958, 2014.
  • [24] Takaaki Hori, Shinji Watanabe, Yu Zhang, and William Chan, “Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM,” in Proc. Interspeech, 2017, pp. 949–953.
  • [25] H. Xu, T. Chen, D. Gao, Y. Wang, K. Li, N. Goel, Y. Carmiel, D. Povey, and S. Khudanpur, “A pruned rnnlm lattice-rescoring algorithm for automatic speech recognition,” in Proc. ICASSP, 2018, pp. 5929–5933.
  • [26] Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur, “A time delay neural network architecture for efficient modeling of long temporal contexts,” in Proc. Interspeech, 2015, pp. 3214–3218.
  • [27] Albert Zeyer, Ralf Schlüter, and Hermann Ney, “Towards online-recognition with deep bidirectional LSTM acoustic models,” in Proc. Interspeech, 2016, pp. 3424–3428.
  • [28] M. Li, M. Liu, and H. Masanori, “End-to-end speech recognition with adaptive computation steps,” in Proc. ICASSP, 2019, pp. 6246–6250.
  • [29] Changhao Shan, Chao Weng, Guangsen Wang, Dan Su, Min Luo, Dong Yu, and Lei Xie, “Component fusion: Learning replaceable language model component for end-to-end speech recognition system,” in Proc. ICASSP, 2019, pp. 5361–5635.
  • [30] Albert Zeyer, Tamer Alkhouli, and Hermann Ney, “RETURNN as a generic flexible neural toolkit with application to translation and speech recognition,” in Proceedings of ACL 2018, System Demonstrations, 2018, pp. 128–133.