Hierarchical Conditional End-to-End ASR with CTC and Multi-Granular Subword Units

10/08/2021 ∙ by Yosuke Higuchi, et al. ∙ 0

In end-to-end automatic speech recognition (ASR), a model is expected to implicitly learn representations suitable for recognizing a word-level sequence. However, the huge abstraction gap between input acoustic signals and output linguistic tokens makes it challenging for a model to learn the representations. In this work, to promote the word-level representation learning in end-to-end ASR, we propose a hierarchical conditional model that is based on connectionist temporal classification (CTC). Our model is trained by auxiliary CTC losses applied to intermediate layers, where the vocabulary size of each target subword sequence is gradually increased as the layer becomes close to the word-level output. Here, we make each level of sequence prediction explicitly conditioned on the previous sequences predicted at lower levels. With the proposed approach, we expect the proposed model to learn the word-level representations effectively by exploiting a hierarchy of linguistic structures. Experimental results on LibriSpeech-100h, 960h and TEDLIUM2 demonstrate that the proposed model improves over a standard CTC-based model and other competitive models from prior work. We further analyze the results to confirm the effectiveness of the intended representation learning with our model.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

End-to-end automatic speech recognition (ASR) aims to model direct speech-to-text conversion [8, 5, 2]

, which substantially simplifies the training and inference processes without external knowledge (e.g., a pronunciation lexicon). With well-established sequence-to-sequence modeling techniques 

[9, 10, 39, 1]

and more sophisticated neural network architectures 

[6, 11, 26], end-to-end ASR models have shown promising performance on various benchmarks [4, 25, 17].

Contrary to carefully designed feature extraction in the traditional pipeline framework, end-to-end models are generally expected to implicitly learn representations suitable for solving a specific task. For example, the learned representations have been shown to represent shape features for image classification 

[45] and syntactic structures for language modeling [32]. However, in ASR, it can be more challenging for an end-to-end model to learn representations automatically. Having no access to segmentation or alignment information, end-to-end ASR models are required to predict word-level linguistic tokens from frame-level acoustic signals. This input-output gap in the level of abstraction makes it difficult to optimize end-to-end ASR, unless a large amount of data or a strong language model is accessible during training or inference [46, 14].

To promote word-level representation learning in end-to-end ASR, we believe that a model should be trained to gradually increase the abstraction level of linguistic information, as it has long been considered reasonable for recognizing speech (i.e., speech phonemes words text) [15]. By exploiting lower levels of abstractions to conditionally compose the higher-level linguistic information, an end-to-end ASR model should be able to handle the sparsity problem of words [38] and extract effective representations.

Figure 1: Proposed hierarchical conditional model of end-to-end ASR.

To achieve such progressive representation learning for ASR, we propose hierarchical conditional modeling of end-to-end ASR (Figure 1). Our model consists of multiple connectionist temporal classification (CTC) [9] losses hierarchically applied to the intermediate and last layers, inspired by previous studies [7, 34, 41, 36, 20, 40, 23]

. Each loss calculation targets sequences with a different granularity of linguistic information: sequences with lower abstraction levels are predicted from the intermediate layers, and a word-level sequence is predicted from the last layer. Specifically, we focus on subwords (n-gram characters) and increase the vocabulary size to word-level as the model layer becomes close to the output (e.g., 256

2k 16k). In addition to this hierarchical structure, we design the model to predict each sequence at an abstraction level by explicitly conditioning on the previously predicted sequences at lower levels, which is crucial for maintaining subwords attributed to composing the higher-level sequence. The proposed model should capture a hierarchy of linguistic structures and yield representations suitable for modeling words.

The key contributions of this work are summarized as follows. 1) We show that the proposed approach enables a CTC-based system to learn accurate word-level ASR, mitigating the data-sparsity issue by gradually increasing the abstraction level of intermediate predictions. 2) Based on experiments conducted on LibriSpeech and TEDLIUM2, we demonstrate the effectiveness of our model independently of variations in the amount of data and speaking styles. All the implementations are made publicly available on our ESPnet fork (https://github.com/YosukeHiguchi/espnet/tree/hierctc). 3) We carefully compare our model with other CTC-based models and further analyze the results, which provides in-depth insights into the advantage of the proposed modeling.

2 Hierarchical Conditional End-to-End ASR

2.1 Baseline architecture of end-to-end ASR

End-to-end ASR is formulated as a sequence-mapping problem between a -length input sequence and -length output sequence . Here, is a -dimensional acoustic feature at frame , is an output token at position , and is a vocabulary. As a baseline, we focus on a Transformer-based model [42] optimized by CTC [9] with intermediate loss calculation [40, 23].

Transformer encoder: For encoding an audio sequence into latent representations, we construct the Transformer encoder [42] consisting of a stack of self-attention layers. The -th layer outputs a sequence of -dimensional latent representations as


where , and is obtained by adding positional encodings to . In Eqs. (1) and (2), layer normalization is applied to each input of the self-attention mechanism and feedforward network . We also train a model with the Conformer encoder [11]

, which introduces a convolution neural network (CNN) into the Transformer encoder, i.e., a convolution module is added between Eqs. (

1) and (2).

Connectionist temporal classification: CTC [9] optimizes the model to predict a monotonic alignment between the encoded input and output . To align the sequences in frame-level, the output sequence is augmented with a unique blank token , which results in a latent token sequence

. On the basis of the conditional independence assumption per token-frame prediction, CTC models the conditional probability

by marginalizing over latent token sequences as


where returns all possible latent sequences compatible with . The CTC loss is defined as the negative log-likelihood of Eq. (3):


Intermediate CTC: In addition to the standard CTC loss calculated from the model output, auxiliary CTC losses can be iteratively applied to intermediate layers [40, 23]. Such intermediate losses effectively regularize the model training and lead to improved ASR performance. We consider training the model with a total of CTC losses applied to the output and intermediate layers:


where , and we equally distribute the weight across the losses [22]. We make each loss calculation conditioned on the CTC predictions obtained from previous layers, as it has shown notable improvement over CTC-based models [29]. For the intermediate layer, from which a CTC loss is calculated, we modify Eq. (2) as


where , and is a sequence of the posterior distributions w.r.t latent tokens computed by CTC.

2.2 Subword segmentation

For tokenizing target sequences, subword segmentation is a widely used approach for alleviating the out-of-vocabulary problem [37], where words in a sentence are split into subword units (or n-gram characters). In the general algorithm for building a subword vocabulary, pairs of subword units are repeatedly merged on the basis of the frequency appearing in a text corpus. The iteration stops when the vocabulary reaches an arbitrary size.

We adopt subwords for tokenizing ASR transcriptions. As opposed to characters, subwords can provide the model with shorter output sequences, thus reduce the difficulty of modeling the dependency between outputs. This can be especially important for CTC-based modeling with the conditional independence assumption. However, it should be noted that increasing the subword vocabulary size makes a sequence close to word-level and potentially lead to the data-sparsity problem [38].

Model LibriSpeech-100h LibriSpeech-960h TEDLIUM2
Dev WER Test WER Dev WER Test WER Dev WER Test WER
clean other clean other clean other clean other
Transformer CTC 11.5 24.8 11.8 25.5 04.2 10.0 04.5 09.9 11.8 10.7
InterCTC 08.9 21.0 09.1 21.7 03.2 08.2 03.5 08.2 09.4 08.6
HC-CTC 08.2 19.9 08.4 20.6 03.1 08.0 03.4 08.0 09.1 08.6
ParaCTC 10.4 24.0 10.9 24.3 04.6 10.3 04.8 10.3 10.9 10.2
Conformer InterCTC 07.1 17.7 07.7 18.3 02.8 06.7 03.0 06.9 08.5 07.8
HC-CTC 06.9 17.1 07.1 17.8 02.8 06.9 03.0 06.8 08.0 07.6
Table 1: Word error rate (WER) [%] on LibriSpeech-{100h, 960h} and TEDLIUM2. Output subword vocabulary size was set to 16k for LibriSpeech-100h and TEDLIUM2, and 32k for LibriSpeech-960h. We did not use language model or beam-search during decoding.

2.3 Proposed hierarchical conditional model

Figure 1 represents an overview of the proposed hierarchical conditional model of end-to-end ASR. It is similar to the intermediate CTC training, but the granularity of subword units is gradually increased to word-level as the sequence transduction proceeds in the self-attention layers. Let be an -length target subword sequence of the -th CTC loss, which is generated by the corresponding subword segmenter with a vocabulary of . We hierarchically increase the vocabulary size, as the position of the CTC loss becomes close to the output layer (i.e., ). Given the target sequences with different units, the objective of the proposed model is defined by modifying Eq. (5) as follows:


If the vocabulary size of each target sequence is the same, Eq. (8) is equal to Eq. (5). With the conditioning mechanism realized by Eq. (7), each CTC loss calculation in Eq. (8) is conditioned on the previously predicted sequences with lower levels of subword units:


where denotes a sequence predicted by the -th CTC, which is implicitly represented by the posterior distributions of latent tokens.

In the proposed hierarchical conditional model, we break down the word-level recognition into a process of progressively integrating subwords in a fine-to-coarse manner. By making the shallower layers predict frequent subwords with small units and the deeper layers predict sparse subwords with large units, we expect the model to use a hierarchy of linguistic structures and yield word-level representations effectively.

2.4 Applying CTC losses in parallel

To verify the effectiveness of the proposed model with the hierarchical structure, we also consider training a model with CTC losses applied in parallel to the final layer, which has been shown effective in several studies [24, 36, 19, 12]. The objective for the parallel CTC losses is defined by modifying Eq. (8) as


We apply a single linear layer to for adapting features to each CTC loss with a different granularity of subword units.

The parallel CTC training treats the predictions of multi-granular sequences equally, where finer subword predictions provide an inductive bias to promote coarse word-level modeling [19].

3 Relationship to Prior Work

Several studies have explored introducing auxiliary CTC losses to intermediate model layers and demonstrated its effectiveness for improving various end-to-end ASR systems, based on attention-based sequence-to-sequence [18, 27]

, recurrent neural network transducer 

[16], and CTC [48, 40, 3, 23]. For the CTC-based system, hierarchically applying low-level supervision (e.g., phonemes) to the intermediate CTC losses has shown to improve a primary CTC loss with higher-level recognition [7, 41, 34, 20, 36]. The proposed model can be considered an extension of these hierarchical CTC-based models. However, our work differs from prior work in the following perspectives. 1) Each CTC loss is explicitly conditioned on the sequences predicted previously at lower abstraction levels. We expect the model to maintain subwords that contribute to composing a word-level sequence and promote the CTC training with conditional dependencies [29]. 2) Given that, in recent studies [40, 23], the intermediate CTC losses are effective even without the hierarchical supervision, we carefully conduct a comparative experiment and further analyze the effectiveness of hierarchical modeling. 3) We only use subwords for target sequences, which does not require additional labeling effort and is easy to control the granularity of target sequences. 4) We evaluate models using the recent state-of-the-art architectures (i.e., Transformer [42] and Conformer [11]).

4 Experiments

4.1 Experimental setup

Data: The experiments were carried out using the LibriSpeech (LS) [30] and TEDLIUM2 (TED2) [35] datasets. LS consists of utterances from read English audio books. We trained the models using the 100-hour subset (LS-100) or the 960-hour full set (LS-960). TED2 consists of utterances from English Ted Talks and contains 210 hours of training data. For each dataset, we used the standard development and test sets. As input speech features, we extracted 80 mel-scale filterbank coefficients with three-dimensional pitch features using Kaldi [33], which were augmented by speed perturbation and SpecAugment [31]. We used SentencePiece [21] to construct subword vocabularies for each dataset.

Evaluated models: CTC denotes a standard CTC-based model trained with from Eq. (4[8]. InterCTC is a conventional model trained with the intermediate CTC losses defined by in Eq. (5[40, 23, 29]. HC-CTC is the proposed hierarchical conditional model trained with from Eq. (8). ParaCTC is a conventional model trained with the parallel CTC losses defined by in Eq. (10[24, 36, 19].

Training and decoding configurations: All experiments were conducted using ESPnet [43]. We used the Transformer [42] architecture to train the above models, which consisted of two CNN layers followed by a stack of 18 self-attention layers. The number of heads , dimension of a self-attention layer , and dimension of a feed-forward network were set to 4, 256, and 2048, respectively. We also trained the models using the Conformer architecture [11], which had a kernel size of 15 and the same configurations as the Transformer-based models, except

was set to 1024. The models were trained up to 100 epochs. For models with multiple CTC losses (i.e., InterCTC, HC-CTC, and ParaCTC), we set the total number of losses to 3 (

). The output vocabulary sizes for LS-100, LS-960, and TED2 were set to 16384, 32768, and 16384, respectively. Each vocabulary size was determined on the basis of the maximum number we could set using SentencePiece, which is large enough to be considered as word-level. InterCTC had intermediate losses with the same vocabulary size as the output’s. For HC-CTC and ParaCTC, we set (, , ) to (256, 2048, 16384) for LS-100 and TED2, and (512, 4096, 32768) for LS-960. After training, a final model was obtained by averaging model parameters over 10 to 20 checkpoints with the best validation performance. During decoding, we did not use any language model and carried out the best path decoding of CTC [9]. Our implementations are publicly available to ensure reproducibility (see Sec. 1).

4.2 Main results

Table 1 lists the results on LS-100, LS-960, and TED2 in terms of the word error rate (WER). Looking at the Transformer results, all the models trained with multiple CTC losses led to an improvement over the standard CTC-based model. Especially, InterCTC and HC-CTC significantly reduced the WER on all of the tasks. On LS-100, HC-CTC showed a clear improvement over InterCTC, indicating the effectiveness of hierarchically increasing subword units. In contrast, on LS-960 and TED2 with more data, the performance gap was reduced, and HC-CTC performed slightly better than InterCTC. Therefore, it can be concluded that our model is particularly effective for smaller-scale data, where the word-level units are likely to become sparser. InterCTC was capable of handling word-level units when there is a sufficient amount of data. However, the large vocabulary-sized softmax calculation (in Eq. (7)) led to a severe slow-down of the InterCTC training and inference processes. HC-CTC, on the other hand, was able to perform faster training and inference, using finer units for the losses from intermediate layers. Due to the same reason regarding the softmax calculation, the model size of HC-CTC was much smaller than that of InterCTC (e.g., 36.4M vs. 67.6M on LS-960). By comparing HC-CTC with ParaCTC, HC-CTC achieved much lower WERs on all tasks, demonstrating the effectiveness of applying CTC losses to intermediate layers as well as gradually increasing the subword units in a hierarchical manner.

Using Conformer further improved the performance of InterCTC and HC-CTC, and HC-CTC again achieved more favorable performance than InterCTC with faster training and inference. Our Conformer results are comparable with other strong CTC-based models of the same size [28, 26], even without exhaustive tuning.

4.3 Analysis on subword vocabulary size

Model -- dev-clean dev-other
InterCTC /256/-/256/-/256 8.4 22.8
InterCTC /02k/-/02k/-/02k 8.5 22.0
InterCTC /16k/-/16k/-/16k 8.9 21.0
HC-CTC /256/-/256/-/16k 8.2 20.2
HC-CTC /02k/-/02k/-/16k 8.4 20.2


plus1fil minus1fil

HC-CTC /256/-/02k/-/16k 8.2 19.9
Table 2: WER [%] on LS-100 dev. sets for Transformer-based models trained with different combinations of subword vocabulary sizes.

While using sparse word-level units can make training of an ASR model challenging [38], we observed that the standard CTC-based model, with the Transformer-based architecture, benefits from training with a large subword vocabulary size. By increasing the output vocabulary size from 256 to 16k, the WERs for dev. sets changed from 11.1/28.1% to 11.5/24.8% on LS-360, and 12.3% to 11.8% on TED2. Similarly, the performance on LS-960 changed from 4.6/12.1% to 4.4/10.5% by changing the vocabulary size from 2k to 32k. These decent improvements from increasing the subword vocabulary size can be attributed to compensating for the CTC’s incapability of modeling output dependencies (cf. Eq.(3)).

Considering the above observation, we evaluated InterCTC and HC-CTC with different combinations of vocabulary sizes, focusing on Transformer-based models trained on LS-100. From the results for InterCTC in Table 2, the performance on the dev-other set improved by increasing the vocabulary size, benefiting from the CTC training with large subword units. HC-CTC performed better than the 16k result of InterCTC, indicating HC-CTC was more effective at modeling word-level recognition besides the advantage of CTC training with a large vocabulary size. While the InterCTC performance on the dev-clean set degraded by increasing the vocabulary size, HC-CTC succeeded in learning robust word-level representations and achieved the lowest WER with the 16k-vocabulary size. Comparing the HC-CTC results, hierarchically increasing the subword units resulted in better performance than using the same vocabulary size for intermediate losses, suggesting the importance of gradually increasing the abstraction level for learning word-level representations effectively.

4.4 Importance of conditioning

We studied the effectiveness of the conditioning mechanism, which is one of the important components of the proposed model (cf. Eq. (9)). The Transformer-based HC-CTC was trained on LS-100 without conditioning each CTC loss (i.e., Eqs. (1) and (2) were used for all the intermediate layers). Note that this model is similar to those from previous studies [7, 41, 34, 20, 36]. Without the conditioning mechanism, HC-CTC achieved WERs of 8.7/20.7% and 9.0/21.3% on dev. sets and test sets, respectively. While these results are better than those obtained from CTC, InterCTC, and ParaCTC in Table 1, HC-CTC with the conditioning mechanism achieved much lower WERs. Overall, we can conclude that 1) hierarchical modeling based on multi-granular subword units as well as 2) the conditioning mechanism for explicitly maintaining lower levels of predictions are effective for learning word-level representations.

4.5 Attention visualization

(a) CTC

(b) HC-CTC

Figure 2: Attention visualization of (a) CTC and (b) HC-CTC trained on LS-100 from Table 1. We manually chose partial utterance from dev-other set (116-288045-0000), transcription of which is “STREETS ASTIR WITH THRONGS OF WELL DRESSED”.

Figure 2 visualizes attention weights between a source (x-axis) and target (y-axis) sequences, comparing Transformer-based (a) CTC and (b) HC-CTC trained on LS-100 from Table 1. We focused on weights that seemed to contribute to predicting a 16k-subword sequence in the final CTC (from the 18-th layer). For HC-CTC, we show the CTC posteriors (from the 12-th layer) for predicting a 2k-subword sequence in advance to see the relationship to the 16k prediction. Comparing the overall weights, HC-CTC learned more solid and confident weights than CTC. HC-CTC seemed to exploit the lower-level 2k predictions to detect important frames for predicting each token, effectively composing complex word-level tokens using the lower-level tokens. For example, HC-CTC successfully recognized the words “THRONGS” and “DRESSED” with proper conjunctions, while CTC failed to handle these infrequent words.

5 Conclusions

We proposed a hierarchical conditional model of CTC-based end-to-end ASR. We trained the model by gradually increasing the subword units for CTC losses applied to intermediate layers. Each CTC loss was conditioned on the sequences with lower abstraction to compose higher-level prediction. Experimental results and in-depth analysis demonstrated that our model effectively learned word-level representations for improving ASR performance. Future work includes introducing an additional decoder network [13] and using acoustic-based subword unit for lower-level predictions [44, 47].

6 Acknowledgement

This work was supported in part by JST ACT-X (JPMJAX210J).


  • [1] D. Bahdanau et al. (2014) Neural machine translation by jointly learning to align and translate. In Proc. ICLR, Cited by: §1.
  • [2] W. Chan et al. (2016) Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proc. ICASSP, pp. 4960–4964. Cited by: §1.
  • [3] E. A. Chi et al. (2021) Align-Refine: Non-autoregressive speech recognition via iterative realignment. In Proc. NAACL-HLT, pp. 1920–1927. Cited by: §3.
  • [4] C. Chiu et al. (2018) State-of-the-art speech recognition with sequence-to-sequence models. In Proc. ICASSP, pp. 4774–4778. Cited by: §1.
  • [5] J. K. Chorowski et al. (2015) Attention-based models for speech recognition. In Proc. NeurIPS, pp. 577–585. Cited by: §1.
  • [6] L. Dong et al. (2018) Speech-Transformer: A no-recurrence sequence-to-sequence model for speech recognition. In Proc. ICASSP, pp. 5884–5888. Cited by: §1.
  • [7] S. Fernández et al. (2007) Sequence labelling in structured domains with hierarchical recurrent neural networks. In Proc. IJCAI, pp. 774–779. Cited by: §1, §3, §4.4.
  • [8] A. Graves and N. Jaitly (2014) Towards end-to-end speech recognition with recurrent neural networks. In Proc. ICML, pp. 1764–1772. Cited by: §1, §4.1.
  • [9] A. Graves et al. (2006) Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proc. ICML, pp. 369–376. Cited by: §1, §1, §2.1, §2.1, §4.1.
  • [10] A. Graves (2012) Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711. Cited by: §1.
  • [11] A. Gulati et al. (2020) Conformer: Convolution-augmented Transformer for speech recognition. In Proc. Interspeech, pp. 5036–5040. Cited by: §1, §2.1, §3, §4.1.
  • [12] A. Heba et al. (2019) Char+CV-CTC: Combining graphemes and consonant/vowel units for CTC-based ASR using multitask learning. In Proc. Interspeech, pp. 1611–1615. Cited by: §2.4.
  • [13] Y. Higuchi et al. (2020) Mask CTC: non-autoregressive end-to-end ASR with CTC and mask predict. In Proc. Interspeech, pp. 3655–3659. Cited by: §5.
  • [14] K. Irie et al. (2019) Language modeling with deep Transformers. In Proc. Interspeech, pp. 3905–3909. Cited by: §1.
  • [15] F. Jelinek (1976) Continuous speech recognition by statistical methods. Proc. IEEE 64 (4), pp. 532–556. Cited by: §1.
  • [16] J. Jeon and E. Kim (2021) Multitask learning and joint optimization for Transformer-RNN-Transducer speech recognition. In Proc. ICASSP, pp. 6793–6797. Cited by: §3.
  • [17] S. Karita et al. (2019) A comparative study on Transformer vs RNN in speech applications. In Proc. ASRU, pp. 449–456. Cited by: §1.
  • [18] S. Kim et al. (2017) Joint CTC-attention based end-to-end speech recognition using multi-task learning. In Proc. ICASSP, pp. 4835–4839. Cited by: §3.
  • [19] J. Kremer et al. (2018) On the inductive bias of word-character-level multi-task learning for speech recognition. arXiv preprint arXiv:1812.02308. Cited by: §2.4, §2.4, §4.1.
  • [20] K. Krishna et al. (2018) Hierarchical multitask learning for CTC-based speech recognition. arXiv preprint arXiv:1807.06234. Cited by: §1, §3, §4.4.
  • [21] T. Kudo (2018) Subword regularization: Improving neural network translation models with multiple subword candidates. In Proc. ACL, pp. 66–75. Cited by: §4.1.
  • [22] J. Lee et al. (2021) Layer pruning on demand with intermediate CTC. In Proc. Interspeech, pp. 3745–3749. Cited by: §2.1.
  • [23] J. Lee and S. Watanabe (2021) Intermediate loss regularization for CTC-based speech recognition. In Proc. ICASSP, pp. 6224–6228. Cited by: §1, §2.1, §2.1, §3, §4.1.
  • [24] J. Li et al. (2017) Acoustic-to-word model without OOV. In Proc. ASRU, pp. 111–117. Cited by: §2.4, §4.1.
  • [25] C. Lüscher et al. (2019) RWTH ASR systems for LibriSpeech: Hybrid vs attention. In Proc. Interspeech, pp. 231–235. Cited by: §1.
  • [26] S. Majumdar et al. (2021) Citrinet: Closing the gap between non-autoregressive and autoregressive end-to-end models for automatic speech recognition. arXiv preprint arXiv:2104.01721. Cited by: §1, §4.2.
  • [27] T. Moriya et al. (2018) Multi-task learning with augmentation strategy for acoustic-to-word attention-based encoder-decoder speech recognition.. In Proc. Interspeech, pp. 2399–2403. Cited by: §3.
  • [28] E. G. Ng et al. (2021) Pushing the limits of non-autoregressive speech recognition. In Proc. Interspeech, pp. 3725–3729. Cited by: §4.2.
  • [29] J. Nozaki and T. Komatsu (2021) Relaxing the conditional independence assumption of CTC-based ASR by conditioning on intermediate predictions. In Proc. Interspeech, pp. 3735–3739. Cited by: §2.1, §3, §4.1.
  • [30] V. Panayotov et al. (2015) Librispeech: an ASR corpus based on public domain audio books. In Proc. ICASSP, pp. 5206–5210. Cited by: §4.1.
  • [31] D. S. Park et al. (2019) SpecAugment: A simple data augmentation method for automatic speech recognition. In Proc. Interspeech, pp. 2613–2617. Cited by: §4.1.
  • [32] M. E. Peters et al. (2018) Deep contextualized word representations. In Proc. NAACL-HLT, pp. 2227–2237. Cited by: §1.
  • [33] D. Povey et al. (2011) The Kaldi speech recognition toolkit. In Proc. ASRU, Cited by: §4.1.
  • [34] K. Rao and H. Sak (2017) Multi-accent speech recognition with hierarchical grapheme based models. In Proc. ICASSP, pp. 4815–4819. Cited by: §1, §3, §4.4.
  • [35] A. Rousseau et al. (2014) Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks. In Porc. LREC, pp. 3935–3939. Cited by: §4.1.
  • [36] R. Sanabria and F. Metze (2018) Hierarchical multitask learning with CTC. In Proc. SLT, pp. 485–490. Cited by: §1, §2.4, §3, §4.1, §4.4.
  • [37] R. Sennrich et al. (2016) Neural machine translation of rare words with subword units. In Proc. ACL, pp. 1715–1725. Cited by: §2.2.
  • [38] H. Soltau et al. (2016) Neural speech recognizer: Acoustic-to-word LSTM model for large vocabulary speech recognition. arXiv preprint arXiv:1610.09975. Cited by: §1, §2.2, §4.3.
  • [39] I. Sutskever et al. (2014) Sequence to sequence learning with neural networks. In Proc. NeurIPS, pp. 3104–3112. Cited by: §1.
  • [40] A. Tjandra et al. (2020)

    Deja-vu: Double feature presentation and iterated loss in deep Transformer networks

    In Proc. ICASSP, pp. 6899–6903. Cited by: §1, §2.1, §2.1, §3, §4.1.
  • [41] S. Toshniwal et al. (2017) Multitask learning with low-level auxiliary tasks for encoder-decoder based speech recognition. arXiv preprint arXiv:1704.01631. Cited by: §1, §3, §4.4.
  • [42] A. Vaswani et al. (2017) Attention is all you need. In Proc. NeurIPS, pp. 5998–6008. Cited by: §2.1, §2.1, §3, §4.1.
  • [43] S. Watanabe et al. (2018) ESPnet: End-to-end speech processing toolkit. In Proc. Interspeech, pp. 2207–2211. Cited by: §4.1.
  • [44] H. Xu et al. (2019) Improving end-to-end speech recognition with pronunciation-assisted sub-word modeling. In Proc. ICASSP, pp. 7110–7114. Cited by: §5.
  • [45] M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In Proc. ECCV, pp. 818–833. Cited by: §1.
  • [46] Y. Zhang et al. (2020)

    Pushing the limits of semi-supervised learning for automatic speech recognition

    arXiv preprint arXiv:2010.10504. Cited by: §1.
  • [47] W. Zhou et al. (2021) Acoustic data-driven subword modeling for end-to-end speech recognition. In Proc. Interspeech, pp. 2886–2890. Cited by: §5.
  • [48] G. Zweig et al. (2017) Advances in all-neural speech recognition. In Proc. ICASSP, pp. 4805–4809. Cited by: §3.