End-to-End (E2E) architecture has been a promising strategy for ASR systems. In this strategy, a single network is employed to directly map acoustic features into a sequence of characters or words without the need of a pronunciation dictionary that is required by the conventional Hidden Markov Model based systems. Furthermore, the components of E2E network can be jointly trained for a common objective criterion to achieve overall optimization. The main approaches for E2E ASR are attention-based encoder-decoder[1, 2, 3, 4, 5, 6], Connectionist Temporal Classification (CTC) [7, 8] and the hybrid CTC/attention architectures [9, 10].
The training of an E2E system requires a large amount of transcribed speech data, here denoted as labelled data, which is unavailable for low-resource languages. However, we note that large external text data can easily be collected. In this work, we focus on the use of external text data to improve the language model (LM) of E2E ASR systems.
In a vanilla E2E architecture, the decoder sub-network (subnet) incorporates the role of the LM. Unlike traditional ASR systems where the LM is separated and hence can easily be trained with text-only data, the decoder subnet is conditioned on the encoder output. As a result, it is not straightforward to update the LM component of the vanilla E2E architecture with the text data.
To address this problem, in this work, we introduce a new architecture which separates the decoder subnet from the encoder output, making the subnet an explicit LM. In this way, the subnet can easily be updated using the text data. A potential issue, named catastrophic forgetting , might occur when using external text to update the E2E network: the network forgets what it has learnt from labelled data. We, therefore, study the strategies that use both labelled and external text data to update the E2E network111Since both labelled and external text data are used, we actually allow entire E2E network to be updated..
The paper is organized as follows. Section 2 describes a vanilla architecture of E2E ASR systems. In Section 3, we first describe the proposed architecture, then present strategies to update the proposed architecture using external text data. Section 4 relates our proposed approach with prior work. Experimental setup and results are presented in Section 5 and 6 respectively. Section 7 concludes our work.
2 A vanilla E2E ASR architecture
In this section, we describe a vanilla attention-based222In actual implementation, we use the hybrid CTC/attention architecture [9, 10]. However, since the CTC module is untouched, we do not mention it during the rest of this paper for simplicity. E2E ASR architecture (denoted as ), which is widely used in prior work [9, 10]. Let first denote as the labelled data. Let X, Y be a training utterance, where X is a sequence of acoustic features and is a sequence of output units.
The E2E architecture consists of an encoder and an attention-based decoder which are shown in Fig. 1 (a). The encoder acts as an acoustic model which maps acoustic features into an intermediate representation h
. Then, the decoder subnet, which consists of an embedding, a Long Short-Term Memory (LSTM) and a projection layers, generates one output unit at each decoding stepas follows,
is the context vector,and are output hidden states at time step and respectively, and
are embedding and projection layers respectively. The E2E network is normally trained in batch-mode with a loss function as follows,
where , and denote the decoding output history, a batch of data and model parameters respectively.
According to Equation (1) and (2), the LSTM is conditioned on the context vector which depends on the encoder output h. In the absence of acoustic features X, thus h, it is not possible to update the E2E architecture appropriately using only text data. One way to alleviate such problem is to set by an all-zero vector . Unfortunately, this method introduces a mismatch between training phase and updating phase (with external text data) since during training is generally not the all-zero vector.
3 Independently trainable LM subnet
To allow updating of LM with external text data, we first introduce (Section 3.1) a novel architecture that separates the decoder subnet from the encoder output. The updating algorithm is described in Section 3.2.
3.1 Decoupling LM subnet
Inspired by the idea of spatial attention 
for image captioning, we propose to decouple the LM subnet from the encoder output as shown in Fig. 1(b). In this architecture, denoted as, the decoding process is formally described as follows,
From Equation (5), the LSTM is only conditioned on the previous decoding hidden state and previous decoding output. In other words, the decoder subnet is a standard LM, hereafter denoted as LM subnet. In this way, this subnet can be independently updated with external text data.
3.2 Updating the LM subnet with external text data
One issue when using external text, denoted as , to improve LM is catastrophic forgetting. Specifically, when is used, the network forgets what it has learnt from . To address such issue, we use both and to update the entire E2E network. Another issue is when should be used, i.e. before or after the entire E2E ASR network is trained. We study two strategies to update the network as presented in Fig. 2.
In Strategy 1, the entire E2E network is first trained using . Then, in the second step, the network is fine-tuned with both and . Finally, the network is further fine-tuned with the data . We empirically found that the last step improves the system performance.
In Strategy 2, the LM subnet is pre-trained with the text data first. Then, the entire E2E network is trained using both and .
In the second step of both strategies, i.e. to use both and to update the E2E network, the following loss function is used:
denotes an interpolation factor,denotes all LM subnet parameters, denotes the LM loss obtained when external text data is used, i.e.:
where is a batch of external text data.
4 Comparison with Related work
There have been several studies on how to use external text data for E2E ASR. One of the ideas is to use the external text data to build an external LM, then incorporate it into the inference process [7, 14] or employ it to re-score n-best output hypotheses . Our proposed approach, however, improves the language modeling capability of an E2E system without using any external LM. Another idea is data synthesis. Specifically,  used a text-to-speech system while  used a pronunciation dictionary and duration information to generate additional inputs given the external text, which are then used to train a correction model  or another encoder . Our proposed approach uses external text data without involving external systems, such as text-to-speech.
In natural language processing, exploiting text corpora to improve E2E systems is also widely used. A popular approach is to use a text corpus to pre-train entire E2E network[17, 18]. Such techniques are only applicable for tasks where both input and output are in text format. Another approach is to pre-train only the decoder by simply removing the encoder [17, 19]. This is equivalent to zeroing out the context vector , which introduces a mismatch as discussed in Section 2.
The idea of separating the decoder from encoder output has been introduced in image captioning research community . To the best of our knowledge, this work is the first attempt applying it on the ASR task.
5 Experimental setup
The HKUST corpus consists of 171.1 hours of Mandarin Chinese conversational telephone speech from Mandarin speakers in mainland China. It is divided into a training set of 166.3 hours, and a test set of 4.8 hours. We split the training data into 3 subsets: the first two subsets are and in this paper, while the remaining subset is used for validation. The detailed information of these data sets is presented in Table 1. For the labelled data , we perform speed-perturbation based data augmentation . We report Mandarin character error rate (CER) on the test set.
The NSC corpus consists of 2,172.6 hours of English read microphone speech from 1,382 Singaporean speakers. We extract data of 6 speakers as testing data. Similar to the HKUST corpus, we split the remaining data into 3 subsets for , and validation. The detailed data division is shown in Table 1. We also perform data augmentation on the labelled data . We report word error rate (WER) on the test set.
|Labelled data ()||22,500||20.2||15,000||20.6|
|External text ()||158,605||-||1,547,399||-|
5.2 E2E configuration
We use the ESPnet toolkit  to develop our E2E models. We use 80 mel-scale filterbank coefficients with pitch as input features. The encoder consists of 6 layers VGG  and 6 layers BLSTM, each has 320 units. In this paper, we used the location-aware attention mechanism . Characters and Byte-Pair Encoding (BPE) (500 units) are used as output units for HKUST and NSC corpora respectively.
We set the batch size for training data as . Since has many more utterances than , we set the batch size for text data as = 150 and 300 for HKUST and NSC respectively. The optimizer is the AdaDelta algorithm 
with gradient clipping. We used for both corpora. During decoding, we used beam width 30 for all conditions.
6 Experimental results
6.1 Independent LM architecture and updating strategies
In this section, we compare the vanilla architecture to the proposed architecture when they are trained using labelled data . We then compare two updating strategies described in Section 3.2 when external text data is used. To update with the external text data, we set the context vector by an all-zero vector  as mentioned in Section 2. The results are presented in Fig. 3. We have following observations.
The proposed consistently outperforms on both HKUST and NSC corpora. Particularly, outperforms by 1.8% relative (from 49.2% to 48.3%) CER and 4.3% relative (from 39.9% to 38.2%) WER on HKUST and NSC corpora respectively.
With external text data, Strategy 1 leads to significant error rate reduction for both architectures. For example, on the architecture, at we observe 14.4% relative (from 38.2% to 32.7%) WER reduction for the NSC corpus. We also observe that outperforms for all cases which indicates that benefits more from the external text.
Strategy 2 generally outperforms Strategy 1 when they are applied on . At , Strategy 2 (denoted as --) achieves the best results on two corpora, i.e. 43.8% CER and 29.5% WER (which are 9.3% relative CER and 22.8% relative WER reduction over ) on HKUST and NSC respectively. We will use -- for experiments in the next section.
6.2 Interaction with external LM
. Specifically, we train a Recurrent Neural Network LM (RNN-LM) as a 1-layer LSTM with 1000 cells for both corpora, then integrate the RNN-LM into inference process ofand --. We also examine the effect of varying amount of labelled data from 20 hours to 60 hours. For this experiment, we only conduct on NSC corpus since the HKUST is relatively small. Results are reported in Fig 4.
We observe that the external RNN-LM improves -- by 0.2% absolute CER and 4.5% absolute WER on 20 hours of HKUST and NSC corpora respectively. The results indicate that our proposed approach benefits from the external LM. Additionally, we observe consistent improvements at different amount of on NSC, which demonstrate that our proposed architecture works well under different amount of labelled data.
We introduced a new architecture that separates the decoder subnet from the encoder output so that it can be easily updated using an external text data. Experimental results showed that the new architecture not only outperforms the vanilla architecture when only labelled data is used, but also benefits from the external text data. We studied two strategies to update the E2E network and found that by pre-training the subnet with the text data then fine-tuning the entire E2E network using both labelled and text data, we achieve the best results. Further analyses also showed that the proposed architecture can be augmented with an external LM for further improvement and can be generalized with different amount of labelled data.
This work is supported by the project of Alibaba-NTU Singapore Joint Research Institute.
-  William Chan, Navdeep Jaitly, Quoc V. Le, and Oriol Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Proc. of ICASSP, 2016, pp. 4960–4964.
-  Dzmitry Bahdanau et al., “End-to-end attention-based large vocabulary speech recognition,” in Proc. of ICASSP, 2016, pp. 4945–4949.
-  Chung-Cheng Chiu et al., “State-of-the-art speech recognition with sequence-to-sequence models,” in Proc. of ICASSP, 2018, pp. 4774–4778.
-  Jan Chorowski et al., “Attention-based models for speech recognition,” in Proc. of NIPS, 2015, pp. 577–585.
-  Jan Chorowski and Navdeep Jaitly, “Towards better decoding and language model integration in sequence to sequence models,” in Proc. of INTERSPEECH, 2017, pp. 523–527.
-  Rohit Prabhavalkar et al., “A comparison of sequence-to-sequence models for speech recognition,” in Proc. of INTERSPEECH, 2017, pp. 939–943.
-  Alex Graves and Navdeep Jaitly, “Towards end-to-end speech recognition with recurrent neural networks,” in Proc. of ICML, 2014, pp. 1764–1772.
-  Dario Amodei et al., “Deep speech 2 : End-to-end speech recognition in english and mandarin,” in Proc. of ICML, 2015, pp. 173–182.
-  Suyoun Kim, Takaaki Hori, and Shinji Watanabe, “Joint ctc-attention based end-to-end speech recognition using multi-task learning,” in Proc. of ICASSP, 2017, pp. 4835–4839.
-  Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R Hershey, and Tomoki Hayashi, “Hybrid ctc/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017.
-  Benedikt Pfülb, Alexander Gepperth, S. Abdullah, and A. Kilian, “Catastrophic forgetting: still a problem for dnns,” CoRR, vol. abs/1905.08077, 2019.
-  Yu-An Chung, Yuxuan Wang, Wei-Ning Hsu, Yu Zhang, and R. J. Skerry-Ryan, “Semi-supervised training for improving data efficiency in end-to-end speech synthesis,” CoRR, vol. abs/1808.10128, 2018.
-  J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing when to look: Adaptive attention via a visual sentinel for image captioning,” in Proc. of CVPR, 2017, pp. 3242–3250.
-  Takaaki Hori, Jaejin Cho, and Shinji Watanabe, “End-to-end speech recognition with word-based rnn language models,” in Proc. of HLT, 2018, pp. 389–396.
-  Jinxi Guo, Tara N. Sainath, and Ron J. Weiss, “A spelling correction model for end-to-end speech recognition,” in Proc. of ICASSP, 2019, pp. 5651–5655.
-  Adithya Renduchintala, Shuoyang Ding, Matthew Wiesner, and Shinji Watanabe, “Multi-modal data augmentation for end-to-end asr,” in Proc. of INTERSPEECH, 2018, pp. 2394–2398.
-  Andrew M Dai and Quoc V Le, “Semi-supervised sequence learning,” in Proc. of NIPS, 2015, pp. 3079–3087.
-  Kaitao Song et al., “MASS: masked sequence to sequence pre-training for language generation,” in Proc. of ICML, 2019, pp. 5926–5936.
-  Prajit Ramachandran, Peter J. Liu, and Quoc V. Le, “Unsupervised pretraining for sequence to sequence learning,” in Proc. of EMNLP, 2017, pp. 383–391.
-  Yi Liu, Pascale Fung, Yongsheng Yang, Christopher Cieri, Shudong Huang, and David Graff, “Hkust/mts: A very large scale mandarin telephone speech corpus.,” in Proc. of ISCSLP, 2006, pp. 724–735.
-  Jia Xin Koh et al., “Building the singapore english national speech corpus,” in Proc. INTERSPEECH 2019, 2019, pp. 321–325.
-  Tom Ko, Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur, “Audio augmentation for speech recognition,” in Proc. of INTERSPEECH, 2015, pp. 3586–3589.
-  Shinji Watanabe et al., “Espnet: End-to-end speech processing toolkit,” in Proc. of INTERSPEECH, 2018, pp. 2207–2211.
-  Takaaki Hori, Shinji Watanabe, Yu Zhang, and William Chan, “Advances in joint ctc-attention based end-to-end speech recognition with a deep cnn encoder and rnn-lm,” in Proc. of INTERSPEECH, 2017, pp. 949–953.
-  Matthew D. Zeiler, “Adadelta: An adaptive learning rate method,” CoRR, vol. abs/1212.5701, 2012.
-  Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio, “Understanding the exploding gradient problem,” CoRR, vol. abs/1211.5063, 2012.