End-to-End Speech Processing Toolkit
This paper introduces a new open source platform for end-to-end speech processing named ESPnet. ESPnet mainly focuses on end-to-end automatic speech recognition (ASR), and adopts widely-used dynamic neural network toolkits, Chainer and PyTorch, as a main deep learning engine. ESPnet also follows the Kaldi ASR toolkit style for data processing, feature extraction/format, and recipes to provide a complete setup for speech recognition and other speech processing experiments. This paper explains a major architecture of this software platform, several important functionalities, which differentiate ESPnet from other open source ASR toolkits, and experimental results with major ASR benchmarks.READ FULL TEXT VIEW PDF
This paper introduces a new end-to-end text-to-speech (E2E-TTS) toolkit ...
We present KoSpeech, an open-source software, which is modular and exten...
In this paper, we present a new open source, production first and produc...
We present EasyASR, a distributed machine learning platform for training...
The availability of open-source software is playing a remarkable role in...
In this paper, we present a new open source toolkit for automatic speech...
We present PyChain, a fully parallelized PyTorch implementation of end-t...
End-to-End Speech Processing Toolkit
Automatic speech recognition (ASR) becomes a mature technology with a lot of research and development efforts mainly in speech processing communities. Especially, these efforts have been driven by popular products including Google voice search, Amazon Alexa, and Apple Siri and open source activities including Kaldi , HTK , Sphinx , Julius , RASR 
in addition to general research activities. These open source toolkits include feature extraction, acoustic modeling based on a hidden Markov model (HMM), Gaussian mixture model, and deep neural network (DNN), and decoding111Language modeling is often performed by external language model toolkits, for example SRILM , and these enable us to use a full set of state-of-the-art ASR research and development achievement.
This paper describes a new open source toolkit named ESPnet (End-to-end speech processing toolkit), which aims to provide a neural end-to-end platform for ASR and other speech processing. Unlike the above open source tools based on hybrid DNN/HMM architecutres , ESPnet provides a single neural network architecture to perform speech recognition in an end-to-end manner. ESPnet adopts widely-used dynamic neural network toolkits, Chainer  and PyTorch , as a main deep learning engine. ESPnet also follows the style of Kaldi ASR toolkit  for data processing, feature extraction/format, and recipes to provide a complete setup for speech recognition and other speech processing experiments.
ESPnet fully utilizes benefits of two major end-to-end ASR implementations based on both connectionist temporal classification (CTC) [10, 11, 12] and attention-based encoder-decoder network [13, 14, 15, 16]. Attention-based methods use an attention mechanism to perform alignment between acoustic frames and recognized symbols, while CTC uses Markov assumptions to efficiently solve sequential problems by dynamic programming. ESPnet adopts hybrid CTC/attention end-to-end ASR , which effectively utilizes the advantages of both architectures in training and decoding. During training, we employ the multiobjective learning framework to improve robustness on irregular alignments and achieve fast convergence. During decoding, we perform joint decoding by combining both attention-based and CTC scores in a one-pass beam search algorithm to further eliminate irregular alignments.
In addition to the above basic architecture, ESPnet supports a number of end-to-end ASR techniques including a fusion of recurrent neural network language model (RNNLM), fast CTC computation by using the warp CTC library , many variations of attention methods. With these state-of-the-art end-to-end ASR techniques, ESPnet also provides a number of recipes for major ASR benchmarks including Wall Street Journal (WSJ) , Librispeech , TED-LIUM , Corpus of Spontaneous Japanese (CSJ) , AMI , HKUST Mandarin CTS , VoxForge , CHiME-4/5 [25, 26], etc. Thus, ESPnet provides publicly available state-of-the-art end-to-end ASR setups, which aim to accelerate the development of this emergent field. This paper describes its basic architecture, functionalities, and benchmark results. Note that several benchmarks including HKUST and CSJ score comparable/superior performance to the state-of-the-art hybrid DNN/HMM systems based on lattice-free maximum mutual information training .
This section mainly focuses on the comparison of ESPnet with publicly available toolkits within an end-to-end ASR framework. We can categorize the toolkits into two types based on CTC and attention architectures as follows:
Note that most of end-to-end ASR toolkits are based on CTC, while ESPnet is based on an attention-based encoder-decoder network. Compared with Attention-LVCSR and OpenNMT, ESPnet has more specific functions to ASR applications including hybrid CTC/attention to deal with monotonic attentions, use of RNNLM during decoding, and a number of Kaldi-style ASR recipes, which make ESPnet unique to the other toolkits.
Figure 1 shows a software architecture of ESPnet. In the ESPnet, main neural network training and recognition parts are written in python, which calls Chainer and PyTorch by switching the backend option. We also provide complete recipes to perform ASR experiments, which are written in the bash scripts by following the Kaldi manner. The following sections describe several unique functions of ESPnet from existing other toolkits.
ESPnet tightly integrates its data preprocessing part with Kaldi so that 1) we can fairly compare the performance obtained by Kaldi hybrid systems with ESPnet end-to-end systems and 2) we can make use of data preprocessing developed in the Kaldi recipe. ESPnet also uses Kaldi feature extraction for most of recipes, although multichannel end-to-end ASR  includes speech enhancement and feature extraction with its network.
The default encoder network is represented by bidirectional long short-term memory (BLSTM) with subsampling (called pyramid BLSTM) given -length speech feature sequence to extract high-level feature sequence as
in general due to the subsampling. The Chainer backend also supports convolutional neural networks based on initial two blocks of VGG layer ()  followed by BLSTM layers inspired by [33, 34], that is
This yields better performance than the pyramid BLSTM in many cases.
ESPnet uses a location-aware attention mechanism , as a default attention. A dot-product attention  is also supported. While the location-aware attention yields better performance, the dot-product attention is much faster in terms of the computational cost. In addition to above attentions, the PyTorch backend supports more than 11 types of attention functions including additive attention , coverage mechanism , and multi-head attention .
ESPnet adopts hybrid CTC/attention end-to-end ASR , which effectively utilizes the advantages of both architectures in training and decoding.
During training, we employ the multi objective learning framework by combining CTC and attention-based cross entropy to improve robustness and achieve fast convergence, as follows:
This training method shares the same encoder with CTC and attention decoder networks. We have one tuning parameter
to linearly interpolate both objective functions and usually set as(equal contributions).
To alleviate overfitting problems, label smoothing techniques are available during training, which smooth the target distribution by dividing the probability mass for the correct label and the remaining labels in a certain ratio. We implemented unigram smoothing, where the distribution of remaining labels is set to be proportional to the unigram distribution of the labels.
CTC is one of the dominant parts for whole computation time in the training. We use a warp CTC library developed by  for both Chainer and PyTorch backends, which yields 5-10% speed improvement in the total training time, compared with build-in CTC in the Chainer backend case.
During decoding, we perform joint decoding by combining both attention-based and CTC scores in a one-pass beam search algorithm to further eliminate irregular alignments. Let be a hypothesis of output label at position given a history and encoder output . The following score combination with attention and CTC log probabilities is performed during the beam search:
This hybrid CTC/attention architecture (multiobjective learning during training and joint decoding during recognition) is proposed in , and a unique function compared with the other end-to-end ASR systems.
One of the most demanded functions of attention-based end-to-end ASR is how to make use of a language model trained with large amount of text corpora. ESPnet can combine the log probability of RNNLM during decoding as follows:
is an additional scaling parameter. This method corresponds to a shallow fusion of a decoder network and RNNLM originally proposed in neural machine translation and applied to end-to-end speech recognition .
Although most of ASR recipes supported in ESPnet are standard English tasks, current ESPnet recipes deal with other languages including Japanese (CSJ), Mandarin Chinese (HKUST CTS), and other European languages through VoxForge. With these various recipes, ESPnet can also realize multilingual end-to-end ASR system (e.g., 10 languages) by following our previous study . In addition, the ESPnet recipes also include noise robust/far-field speech recognition tasks including AMI , CHiME-4 , and CHiME-5 tasks . Especially ESPnet is an official end-to-end ASR baseline for the CHiME-5 challenge.
shows a flow of standard recipes in ESPnet. The recipe is significantly simplified thanks to the benefit of end-to-end ASR, e.g., it does not have to include lexicon preparation, finite state transducer (FST) compilation, training/alignment based on HMM and Gaussian mixture modeling, and lattice generation for sequence discriminative training.
The standard recipe includes the following 6 stages in
run.sh222Several recipes including AMI, Librispeech, TED-LIUM, and VoxForge have an additional data downloading stage (stage -1).:
We adopt the Kaldi data directory format, and we can simply use the Kaldi data preparation script (e.g.,
Feature extraction: Again, we use the Kaldi feature extraction. Most of recipes use the 80-dimensional log Mel feature with the pitch feature (totally 83 dimensions).
Data preparation for ESPnet:
This stage converts all the information including in the Kaldi data directory (transcriptions, speaker and language IDs, and input and output lengths) to one JSON file (
data.json) except for input features.
Language model training: Character-based RNNLM is trained by using either Chainer or PyTorch backend. This is an optional stage, and several recipes do not have this stage).
End-to-end ASR training: Hybrid CTC/attention-based encoder-decoder is trained by using either Chainer or PyTorch backend.
Recognition: Speech recognition is performed by using RNNLM and end-to-end ASR model obtained by stages 3 and 4, respectively.
In addition to the actual experimental stage, ESPnet also simplifies its coding lines. Table 1 compared the main source code of Kaldi, Julius, and ESPnet. ESPnet can realize speech recognition including trainer and recognizer functions by only using 5K lines of python codes compared with Kaldi and Julius, thanks to the simplification of end-to-end ASR and use of Chainer or PyTorch for neural network backends and Kaldi for data preparation and feature extraction333Since Kaldi and Julius have various function including online real-time modes and Windows interfaces unlike ESPnet, we cannot directly compare them with the source code lines..
One of the most simplified module is a model representation part, since it does not have to explicitly represent a complicated speech recognition hierarchy from speech features, HMM states, context dependent phonemes, lexicons, to words. This hierarchy is represented by a single neural network with at most thousand lines of python codes. This also yields to simplify the recognition module with at most five hundred lines, as it is realized by a simple output-synchronous beam search.
This section discusses the experimental results of our three main tasks, WSJ, CSJ, and HKUST. The first experiment shows the effectiveness of the ESPnet with the famous WSJ tasks by using several experimental configurations, and also compare the reports on the same task within an end-to-end ASR framework. The other experiments compare the performance of ESPnet with state-of-the-art ASR systems for the CSJ and HKUST tasks. The main reason for choosing these two languages is that these ideogram languages have relatively shorter lengths for letter sequences than those in alphabet languages, which greatly reduces the computational complexities, and makes it easy to handle context information in a decoder network. Actually, our prior investigation shows that Japanese and Mandarin Chinese end-to-end ASR can be easily scaled up, and shows reasonable performance without using various tricks developed for large-scale English tasks.
|ESPnet with VGG-BLSTM||CER||10.1||7.6|
|+ BLSTM layers (4 6)||CER||8.5||5.9|
|+ joint decoding||CER||5.5||3.8|
|+ label smoothing||CER||5.3||3.6|
|seq2seq + CNN (no LM) ||WER||N/A||10.5|
|seq2seq + FST word LM ||CER||N/A||3.9|
|CTC + FST word LM ||WER||N/A||7.3|
|Method||Wall Clock Time||# GPUs|
|ESPnet (Chainer)||20 hours||1|
|ESPnet (PyTorch)||5 hours||1|
|seq2seq + CNN ||120 hours||10|
Table 2 compares the performance of the ESPnet with different techniques in the WSJ task. The use of a deeper encoder network, integration of character-based LSTMLM, and joint CTC/attention decoding steadily improved the performance. Table 2 also compares the result of ESPnet with the other reports. Since these reports are based on different conditions (e.g.,  does not use any language models, while  and  use a word-based language model through FST), we cannot directly compare them. But we can state that ESPnet provides reasonable performance by comparing with these prior studies. Table 2 also provides the computational time for main end-to-end ASR network training with number of GPUs. ESPnet achieved very fast training especially for the PyTorch backend even with a single GPU (gtx1080ti), compared with  for the same WSJ task.
However, one of the issues of these end-to-end ASR systems is that their performance does not reach that of the state-of-the-art hybrid HMM/DNN systems. For example, the WER of the hybrid HMM/DNN systems for the WSJ task is below 5%, and this degradation probably comes from the lack of the amount of training data. Actually,  and  report comparable or superior performance to the state-of-the-art hybrid HMM/DNN systems in very large English tasks, although these results are not usually accomplished by many of research communities due to the lack of computational resources. Therefore, scaling up the English task with keeping low computational resources or improving the performance by mitigating the data sparseness issue is one of our important future studies.
|ESPnet (5 GPUs)||8.5||6.1||6.8|
|HMM/DNN (Kaldi nnet1)||9.0||7.2||9.6|
|HMM/LSTM (Kaldi nnet3)||33.5|
|CTC with language model ||34.8|
|HMM/TDNN, LF MMI ||28.2|
Compared with the English tasks, end-to-end ASR systems can easily achieve comparable performance to the state-of-the-art hybrid HMM/DNN systems in the Japanese and Mandarin Chinese tasks. Note that ESPnet does not use lexical information (pronunciation dictionary and morphological analyzer), which are essential components in the HMM/DNN and CTC-syllable systems. Tables 3 and 4 compare the best system of ESPnet (i.e., VGG-BLSTM, char-RNNLM, and joint decoding) with the hybrid HMM/DNN systems. Especially, ESPnet almost reached the latest best performance of the HMM/DNN system with lattice-free MMI training  in the HKUST task.
This paper introduced a new end-to-end ASR toolkit named ESPnet. ESPnet fully utilizes dynamic neural network toolkits, Chainer and PyTorch, as a main deep learning engine, and extremely simplifies training and recognition of the whole ASR pipeline. A number of experiments and comparisons with other reports show that ESPnet achieves reasonable ASR performance and also reaches comparable performance to the state-of-the-art HMM/DNN systems with a legacy setup. ESPnet has been actively developed, and multi-GPU function, data augmentation, multihead decoder, multichannel end-to-end ASR, and Babel multilingual ASR experiments are in preparation. Especially with the multi-GPU function (5 GPUs), ESPnet finished the training of 581 hours of the CSJ task only with 26 hours.
Proceedings of workshop on machine learning systems (LearningSys) in the twenty-ninth annual conference on neural information processing systems (NIPS), vol. 5, 2015.