Multiple-hypothesis CTC-based semi-supervised adaptation of end-to-end speech recognition

03/29/2021 ∙ by Cong-Thanh Do, et al. ∙ 0

This paper proposes an adaptation method for end-to-end speech recognition. In this method, multiple automatic speech recognition (ASR) 1-best hypotheses are integrated in the computation of the connectionist temporal classification (CTC) loss function. The integration of multiple ASR hypotheses helps alleviating the impact of errors in the ASR hypotheses to the computation of the CTC loss when ASR hypotheses are used. When being applied in semi-supervised adaptation scenarios where part of the adaptation data do not have labels, the CTC loss of the proposed method is computed from different ASR 1-best hypotheses obtained by decoding the unlabeled adaptation data. Experiments are performed in clean and multi-condition training scenarios where the CTC-based end-to-end ASR systems are trained on Wall Street Journal (WSJ) clean training data and CHiME-4 multi-condition training data, respectively, and tested on Aurora-4 test data. The proposed adaptation method yields 6.6 and 5.8 training scenarios, respectively, compared to a baseline system which is adapted with part of the adaptation data having manual transcriptions using back-propagation fine-tuning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Mismatch between training and test data is common when using automatic speech recognition (ASR) systems in realistic conditions. Among other robustness methods, adaptation algorithms developed for ASR aim at alleviating this mismatch. Adapting large and complex models, especially deep neural network (DNN)-based models, is challenging with typically a small amount of adaptation (target) data and without explicit supervision

[1].

Adaptation algorithms use adaptation data, which should be matched to the target test data, to adapt the trained ASR system and close the gap between training and test. The transcriptions, or labels, of the adaptation data are required in supervised adaptation. However, manual transcriptions are not always available because obtaining these transcriptions for a large amount of data is costly. When manual transcriptions are not available, ASR hypotheses, or “pseudo-labels”, can be used in the place of manual transcriptions. The ASR hypotheses are obtained by decoding the adaptation data using the trained (non-adapted) system. When ASR hypotheses are used, inaccurate information is present because the automatic transcriptions are typically not free of errors.

End-to-end speech recognition uses a single neural network architecture within the deep learning framework to perform speech-to-text task. In the training of end-to-end speech recognition systems, the need for having prior alignments between acoustic frames and output symbols is eliminated thanks to the use of training criteria such as the attention mechanism

[2] or the connectionist temporal classification (CTC) loss function [11].

Connectionist temporal classification (CTC) is the process of automatically labeling unsegmented data sequences using a neural network [10]. The training of a neural network using the CTC loss function thus does not require prior alignments between the input and target sequences. In the training of neural network using CTC loss and characters as output symbols, for a given transcription of the input sequence, there are as many possible alignments as there are different ways of separating the characters with blanks. As the exact character sequence, derived from the transcription, corresponding to the input sequence is not known, the sum over all possible character sequences is performed [11]. In semi-supervised or unsupervised adaptation where ASR hypotheses are used, the computation of the CTC loss could be unfavorably affected because there are errors in the transcriptions which are in essence the ASR hypotheses.

In this paper, we propose an adaptation method for CTC-based end-to-end speech recognition in which the impact of errors in the transcriptions to the CTC loss computation is alleviated by combining CTC losses computed from different ASR 1-best hypotheses. In the present paper, the ASR 1-best hypotheses are obtained by using ASR systems with different acoustic features to decode the unlabeled adaptation data. We show the effectiveness of the proposed adaptation method in semi-supervised adaptation scenarios where the CTC-based end-to-end speech recognition systems are trained either on clean training data from the Wall Street Journal (WSJ) corpus [19] or on multi-condition training data of the CHiME-4 corpus [28], while evaluating on the test data of Aurora-4 corpus [18].

The paper is organized as follows. Section 2 presents related works. The proposed adaptation method using multiple ASR hypotheses and CTC losses combination is introduced in section 3. Sections 4 and 5 present about ASR systems training and adaptation experiments, respectively. Results are presented in section 6. Finally, section 7 concludes the paper.

2 Related works

Adaptation of end-to-end speech recognition has been investigated in a number of studies [23, 14, 17, 3, 15, 27, 24, 6, 4]. In [14]

, adaptation of the end-to-end model was achieved by introducing Kullback-Leibler divergence (KLD) regularization and multi-task learning (MTL) criterion into the CTC loss function. The training criteria are the linear combination of the standard CTC loss and the KLD or the MTL criterion. Multiple hypotheses were previously used in cross-system acoustic model adaptation where the transcriptions for adaptation were generated by several systems, which were built with various phoneme sets or acoustic front-ends

[9, 26].

In the present work, a new loss function created by combining the CTC losses computed from different ASR 1-best hypotheses is used during adaptation. The ASR 1-best hypotheses are obtained by decoding the unlabeled adaptation data with ASR systems using different acoustic features.

3 Proposed adaptation method

3.1 Training of CTC-based end-to-end speech recognition

Given a

-length acoustic feature vector sequence

, where is a -dimensional feature vector at frame , and a transcription which consists of characters, where is a set of distinct characters, during the training of the neural network the standard CTC loss function is defined as follows:

(1)

where are the network parameters. The network is trained to minimize . In equation (1), is the transcription of

which can be either a manual transcription or an ASR hypothesis. In the present work, the ASR systems are trained using manual transcriptions in supervised training mode. The convolutional neural network (CNN)

[13]

- bidirectional long short-term memory (BLSTM)

[12] architecture is used.

The CTC loss function in equation (1) can be computed thanks to the introduction of the CTC path which forces the output character sequence to have the same length as the input feature sequence by adding blank as an additional label and allowing repetition of labels [11]. The CTC loss is thus computed by integrating over all possible CTC paths expanded from :

(2)

3.2 Multiple-hypothesis CTC-based adaptation

Given adaptation data, among other adaptation methods mentioned in section 2, the CTC-based end-to-end speech recognition system can be adapted by using back-propagation algorithm [22] to fine-tune the trained neural network [6]

. During the minimization of the CTC loss function using stochastic gradient descent

[21], the parameters of the neural network are updated. When the manual transcriptions of the adaptation data are not available, ASR 1-best hypotheses obtained by using the trained neural network to decode the adaptation data can be used in the adaptation process.

In this paper, we propose to integrate multiple ASR 1-best hypotheses in the computation of the CTC loss function during adaptation, when the manual transcriptions are not available, as follows:

(3)

where are the 1-best hypotheses obtained by decoding the unlabeled adaptation data using different trained neural networks. By combining multiple 1-best hypotheses in the computation of the CTC loss, the impact of the errors in the ASR hypotheses to the computation of the CTC loss function could be alleviated. Using property of the logarithm, the equation (3) can be rewritten as follows:

(4)

where is a CTC path linking the 1-best hypothesis and the acoustic feature sequence .

In the computation of the new CTC loss in the present paper, different ASR 1-best hypotheses are obtained by decoding the adaptation data with different ASR systems. Different ASR hypotheses could be obtained by other means, for instance by using N-best hypotheses from one decoding. This possibility is not explored in the present paper. Also, no confidence-based filtering [1] is applied on the ASR hypotheses. In the experiments of the present paper, the use of two systems () is explored.

3.3 Analysis

We analyze the new loss function for the simplified case where two 1-best hypotheses are used. The equation (4) becomes:

(5)

where and are ones of the CTC paths linking the 1-best hypotheses and , respectively, with the acoustic feature sequence . From equation (5

), it can be seen that a probability

, computed by using the CTC path , would be multiplied with all the probabilities . This weighting, based on the probabilities computed from different CTC paths in , could possibly alleviate the impact of uncertainty in the CTC paths , caused by transcription errors in , to the computation of the CTC loss .

4 Speech recognition system and data

The effectiveness of the proposed adaptation method is evaluated in semi-supervised adaptation scenarios where only part of the adaptation data have manual transcriptions. This scenario is popular when manual transcriptions can be obtained only for a small amount of adaptation data instead of total amount of adaptation data, to reduce the cost. The end-to-end ASR systems are trained using the standard CTC loss function (see equation (1)). The proposed CTC loss function is used only in the adaptation using the proposed multiple-hypothesis CTC-based adaptation method. Other adaptations use the standard CTC loss function as in equation (1).

4.1 CTC-based end-to-end speech recognition systems

4.1.1 Acoustic features

CNN-BLSTM neural network architecture is trained with CTC loss to map acoustic feature sequences to character sequences. A baseline system is trained by using 40-dimensional log-Mel filter-bank (FBANK) features [16] as acoustic features. The FBANK features are augmented with 3 dimensional pitch features [29, 20]

. Delta and acceleration features are appended to the static features. The feature extraction of the baseline system was performed by using the standard feature extraction recipe of Kaldi toolkit

[20].

To have additional ASR hypotheses, another system is trained to decode the unlabeled adaptation data. The system is trained by using 40-dimensional subband temporal envelope (STE) features [5] together with 3-dimensional pitch features. Similar to the system trained with FBANK features, the delta and acceleration features are included. STE features track energy peaks in perceptual frequency bands which reflect the resonant properties of the vocal tract. These features have been shown to be on par with the standard FBANK features in various speech recognition scenarios [8, 7]. FBANK and STE features are also complementary to each other and combining the systems using these features yielded significant WER reductions compared to single system [5, 8, 7].

4.1.2 Neural network architecture

The neural network architecture for end-to-end ASR systems is made up of initial layers of the VGG net architecture (deep CNN) [25] followed by a 6-layer pyramid BLSTM (BLSTM with subsampling [29]

). We use a 6-layer CNN architecture which consists of two consecutive 2D convolutional layers followed by one 2D Max-pooling layer, then another two 2D convolutional layers followed by one 2D max-pooling layer. The 2D filters used in the convolutional layers have the same size of 3

3. The max-pooling layers have patch of 3

3 and stride of 2

2. The 6-layer BLSTM has 1024 memory blocks in each layer and direction, and linear projection is followed by each BLSTM layer. The subsampling factor performed by the BLSTM is 4 [29]. During decoding, CTC score is used in a one-pass beam search algorithm [29]. The beam width is set to 20. Training and decoding are performed using the ESPnet toolkit [29].

4.2 Data

4.2.1 Clean training data

WSJ is a corpus of read speech [19]. All the speech utterances are sampled at 16 kHz and are fairly clean. The WSJ’s standard training set train_si284 consists of around 81 hours of speech. During training, the standard development set test_dev93, which consists of around 1 hour of speech, is used for cross-validation.

4.2.2 Multi-condition training data

The multi-condition training data of CHiME-4 corpus [28] consists of around 189 hours of speech, in total. The CHiME-4 multi-condition training data consists of the clean speech utterances from WSJ training corpus and simulated and real noisy data. The real data consists of 6-channel recordings of utterances from WSJ corpus spoken in four environments: café, street junction, public transport (bus), and pedestrian area. The simulated data was constructed by mixing WSJ clean utterances with the environment background recordings from the four mentioned environments. All the data were sampled at 16 kHz. Audio recorded from all the microphone channels are included in the CHiME-4 multi-condition training data, named tr05_multi_noisy_si284 in the ESPnet CHiME-4 recipe. The dt05_multi_isolated_1ch_track set was used for cross-validation during training.

4.2.3 Test and adaptation data

Test and adaptation sets are created from the test sets of the Aurora-4 corpus [18]. The Aurora-4 corpus has 14 test sets which were created by corrupting two clean test sets, recorded by a primary Sennheiser microphone and a secondary microphone, with six types of noises: airport, babble, car, restaurant, street, and train, at 5-15 dB SNRs. The two clean test sets were also included in the 14 test sets. There are 330 utterances in each test set. The noises in Aurora-4 are different from those in the CHiME-4 multi-condition training data. In this work, the .wv1 data [18] from 7 test sets created from the clean test set recorded by the primary Sennheiser microphone are used to create test and adaptation sets.

From 2310 utterances taken from the 7 test sets of .wv1 data, a test set of 1400 utterances (approx. 2.8 hours of speech), a labeled adaptation set of 300 utterances (approx. 36 minutes), and an unlabeled adaptation set of 610 utterances (approx. 1.2 hours) are separated. The selection of the utterances in the three sets are random. The utterances in the three sets are not overlapped. These sets are used for testing and adaptation in both clean training and multi-condition training scenarios.

5 Adaptation experiments

Let and be the end-to-end models trained with FBANK and STE features, respectively, on the clean or multi-condition training data, the semi-supervised adaptation experiment is performed as follows (in this section, for the sake of clarity, notations for clean and multi-condition training data are not included):

  • [leftmargin=*]

  • First the back-propagation algorithm is used to fine-tune the models and in supervised mode using the labeled adaptation set of 300 utterances to obtain the adapted model and , respectively (see Figure 1). This step is done to make use of the available labeled adaptation data and to reduce further the WERs of the ASR systems.

  • The models and are subsequently used to decode the unlabeled adaptation set of 610 utterances. Assume that and are the sets of 1-best hypotheses obtained from these decoding and is the set of manual transcriptions available for the 300 utterances set, we group the 300-utterance and 610-utterance sets to create an adaptation set of 910 utterances whose labels could be either or .

  • Finally, the 910-utterance set is used to adapt the model , which is the initial model, using back-propagation algorithm to obtain the semi-supervised adapted model .

Figure 1: Supervised adaptation of initial models and using the 300-utterance set with manual transcriptions . The models can be trained either on clean or multi-condition training data.
Figure 2: Semi-supervised adaptations using the 910-utterance adaptation set, of which the labels include the manual transcriptions and one of the sets of 1-best hypotheses, and , or both.

The 910-utterance adaptation set in which 610 utterances do not have manual transcriptions is used to adapt the initial FBANK-based system in semi-supervised mode since only 300 utterances have manual transcriptions. The conventional semi-supervised adaptation using the 910-utterance adaptation set can be done with the labels from and, either or . This adaptation uses the standard CTC loss . The proposed multiple-hypothesis CTC-based adaptation method, denoted as MH-CTC, uses the manual transcriptions and both sets of 1-best hypotheses, and . This adaptation used the loss. These semi-supervised adaptation experiments are depicted in Figure 2.

The referenced performance which can be considered as an upper bound performance for all the mentioned adaptation methods is that obtained with the supervised adaptation where all 910 utterances have manual transcriptions . During adaptation, the learning rate is kept unchanged compared to that used during training because this configuration yields better performance than using different learning rates during training and adaptation. On the other hand, the 1-best hypotheses are obtained after one pass of decoding.

6 Results

6.1 Clean training

In the scenario where the systems are trained on the WSJ clean training data and tested on the test set consisting of 1400 Aurora-4 utterances, the initial systems which use the models and , respectively, have WERs of 55.2% and 60.3%, respectively. The results of applying different adaptation methods to the FBANK-based system are shown in Table 1. Adapting the initial FBANK-based and STE-based systems with the labeled adaptation set of 300 utterances reduces the WERs of these systems measured on the 1400-utterance test set to 27.2% and 24.5%, respectively. The corresponding WERs measured on the 610-utterance unlabeled adaptation set are 29.1% and 25.6%, respectively.

Supervised adaptation using the 300-utterance adaptation set with manual transcriptions is used as the baseline. It can be observed from Table 1, that, the proposed multiple-hypothesis CTC-based adaptation method yields 6.6% relative WER reduction compared to the baseline. In contrast, the two conventional semi-supervised adaptations which use both manual transcriptions and one of the sets of 1-best hypotheses, or , do not yield WER reduction compared to the FBANK-based baseline system.

6.2 Multi-condition training

The experiments in the clean training scenario are repeated for the multi-condition training scenario. When being trained on multi-condition training data of CHiME-4 and tested on the 1400-utterance test set from Aurora-4, the initial CTC-based end-to-end ASR systems using FBANK and STE features have WERs of 31.0% and 33.8%, respectively. Adapting the initial FBANK-based and STE-based systems with the labeled adaptation set of 300 utterances reduces the WERs of these systems measured on the 1400-utterance test set to 17.2% and 17.3%, respectively. The corresponding WERs which are measured on the 610-utterance unlabeled adaptation set are 18.3% and 18.9%, respectively. Results of the adaptation experiments in this scenario are shown in Table 2. Similar to in the clean training scenario, the proposed adaptation method (MH-CTC) yields 5.8% relative WER reduction compared to the baseline. The semi-supervised adaptations using single 1-best hypotheses or together with the manual transcriptions do not yield WER reduction compared to the baseline.

Table 1: Adaptation of the FBANK-based ASR system trained on WSJ clean training set with different adaptation methods. and are obtained in the decoding using clean training models.
Table 2: Adaptation of the FBANK-based ASR system trained on CHiME-4 multi-condition training set with different adaptation methods. and are obtained in the decoding using multi-condition training models.

In both clean and multi-condition training scenarios, the supervised adaptations which use manual transcriptions for all 910 utterances have the lowest WERs.

7 Conclusion

This paper has proposed an adaptation method for end-to-end speech recognition. Multiple ASR 1-best hypotheses were used in the computation of the CTC loss function to alleviate the impact of errors in the ASR hypotheses to the computation of CTC loss when the 1-best hypotheses are used as labels instead of manual transcriptions. The 1-best hypotheses were obtained by using a main ASR system and an additional ASR system which use FBANK and STE features, respectively, to decode the unlabeled adaptation data. In clean and multi-condition training scenarios, the proposed adaptation method yielded 6.6% and 5.8% relative WER reductions, respectively, compared to the baseline system which was adapted with back-propagation fine-tuning using an adaptation subset having manual transcriptions. In contrast, conventional semi-supervised back-propagation fine-tuning did not yield WER reduction compared to the baseline system. To our knowledge, this is the first time the integration of multiple ASR hypotheses in the CTC loss function has been shown to be consistently effective in reducing WER, and thus, is promising for future work.

References

  • [1] P. Bell, J. Fainberg, O. Klejch, J. Li, S. Renals, and P. Swietojanski (2020) Adaptation algorithms for speech recognition: an overview. In arXiv preprint arXiv: 2008.06580, Cited by: §1, §3.2.
  • [2] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio (2015) Attention-based models for speech recognition. In Proc. Advances in Neural Information Processing Systems (NIPS), pp. 577–585. Cited by: §1.
  • [3] M. Delcroix, S. Watanabe, A. Ogawa, S. Karita, and T. Nakatani (2018-09) Auxilary feature based adaptation of end-to-end ASR systems. In Proc. INTERSPEECH, Hyderabad, India, pp. 2444–2448. Cited by: §2.
  • [4] F. Ding, W. Guo, B. Gu, Z. Ling, and J. Du (2020-10) Adaptive speaker normalization for CTC-based speech recognition. In Proc. INTERSPEECH, Shanghai, China, pp. 1266–1270. Cited by: §2.
  • [5] C.-T. Do and Y. Stylianou (2017-08)

    Improved automatic speech recognition using subband temporal envelope features and time-delay neural network denoising autoencoder

    .
    In Proc. INTERSPEECH, Stockholm, Sweden, pp. 3832–3836. Cited by: §4.1.1.
  • [6] C.-T. Do, S. Zhang, and T. Hain (2020-08) Selective adaptation of end-to-end speech recognition using hybrid CTC/attention architecture for noise robustness. In Proc. of the 28th European Signal Processing Conference (EUSIPCO), Amsterdam, The Netherlands, pp. 321–325. Cited by: §2, §3.2.
  • [7] C.-T. Do (2019-05) Subband temporal envelope features and data augmentation for end-to-end recognition of distant conversational speech. In Proc. IEEE ICASSP, Brighton, UK, pp. 6251–6255. Cited by: §4.1.1.
  • [8] R. Doddipatla, T. Kagoshima, C.-T. Do, P.N. Petkov, C. Zorila, E. Kim, D. Hayakawa, H. Fujimura, and Y. Stylianou (2018-09) The Toshiba entry to the CHiME 2018 challenge. In Proc. CHiME 2018 Workshop on Speech Processing in Everyday Environments, Hyderabad, India, pp. 41–45. Cited by: §4.1.1.
  • [9] D. Giuliani and F. Brugnara (2007-Dec.) Experiments on cross-system acoustic model adaptation. In Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Kyoto, Japan, pp. 117–122. Cited by: §2.
  • [10] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber (2006-06)

    Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

    .
    In

    Proc. of the 23rd International Conference on Machine Learning

    ,
    Pittsburgh, USA, pp. 369–376. Cited by: §1.
  • [11] A. Graves and N. Jaitly (2014-06) Towards end-to-end speech recognition with recurrent neural networks. In Proc. of the 31st International Conference on Machine Learning, Beijing, China, pp. 1764–1772. Cited by: §1, §1, §3.1.
  • [12] S. Hochreiter and J. Schmidhuber (1997-11) Long short-term memory. Neural Computation 9, pp. 1735–1780. Cited by: §3.1.
  • [13] Y. LeCun and Y. Bengio (1995) Convolutional networks for images, speech, and time series. In The Handbook of Brain Theory and Neural Networks, Cited by: §3.1.
  • [14] K. Li, J. Li, Y. Zhao, K. Kumar, and Y. Gong (2018-12) Speaker adaptation for end-to-end CTC models. In Proc. IEEE Spoken Language Technology Workshop, Athens, Greece, pp. 542–549. Cited by: §2.
  • [15] Z. Meng, Y. Gaur, J. Li, and Y. Gong (2019-09) Speaker adaptation for attention-based end-to-end speech recognition. In Proc. INTERSPEECH, Graz, Austria, pp. 241–245. Cited by: §2.
  • [16] A.-R. Mohamed, G. Hinton, and G. Penn (2012-03)

    Understanding how deep belief networks perform acoustic modelling

    .
    In Proc. IEEE ICASSP, Kyoto, Japan, pp. 4273–4276. Cited by: §4.1.1.
  • [17] T. Ochiai, S. Watanabe, S. Katagiri, T. Hori, and J. Hershey (2018-04) Speaker adaptation for multichannel end-to-end speech recognition. In Proc. IEEE ICASSP, Calgary, Canada, pp. 6707–6711. Cited by: §2.
  • [18] N. Parihar and J. Picone (2002) Aurora working group: DSR front end LVCSR evaluation: AU/384/02. Institute for Signal and Information Processing Technical Report. Cited by: §1, §4.2.3.
  • [19] D. B. Paul and J. M. Barker (1992-02) The design for the Wall Street Journal-based CSR corpus. In HLT ’91 Proceedings of the workshop on Speech and Natural Language, New York, USA, pp. 357–362. Cited by: §1, §4.2.1.
  • [20] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely (2011-12) The Kaldi speech recognition toolkit. In Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Hawaii, USA. Cited by: §4.1.1.
  • [21] S. Ruder (2016) An overview of gradient descent optimisation algorithms. In arXiv preprint arXiv: 1609.04747, Cited by: §3.2.
  • [22] D. E. Rumelhart, G. E. Hinton, and R. J. Williams (1986) Learning representations by back-propagating errors. Nature 323 (9), pp. 533–536. Cited by: §3.2.
  • [23] L. Samarakoon, B. Mak, and A. Y. S. Lam (2018-12) Domain adaptation of end-to-end speech recognition in low-resource settings. In Proc. IEEE Spoken Language Technology Workshop, Athens, Greece, pp. 382–388. Cited by: §2.
  • [24] L. Sari, N. Moritz, T. Hori, and J. Le Roux (2020-05) Unsupervised speaker adaptation using attention-based speaker memory for end-to-end ASR. In Proc. IEEE ICASSP, Barcelona, Spain, pp. 7384–7388. Cited by: §2.
  • [25] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In Proc. International Conference on Learning Representations, Cited by: §4.1.2.
  • [26] S. Stueker, C. Fuegen, S. Burger, and M. Woelfel (2006-09) Cross-system adaptation and combination for continuous speech recognition: the influence of phoneme set and acoustic front-end. In Proc. INTERSPEECH, Pittsburgh, USA, pp. 521–524. Cited by: §2.
  • [27] E. Tsunoo, Y. Kashiwagi, S. Asakawa, and T. Kumakura (2019-09)

    End-to-end adaptation with backpropagation through WFST for on-device speech recognition system

    .
    In Proc. INTERSPEECH, Graz, Austria, pp. 764–768. Cited by: §2.
  • [28] E. Vincent, S. Watanabe, A. A. Nugraha, J. Barker, and R. Marxer (2017-09) An analysis of environment, microphone and data simulation mismatches in robust speech recognition. Computer Speech and Language 46, pp. 535–557. Cited by: §1, §4.2.2.
  • [29] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai (2018-09) ESPnet: end-to-end speech processing toolkit. In Proc. INTERSPEECH, Hyderabad, India, pp. 2207–2211. Cited by: §4.1.1, §4.1.2.