Transformer is a sequence-to-sequence (S2S) architecture originally proposed for neural machine translation (NMT) [VaswaniNIPS2017_7181] that rapidly replaces recurrent neural networks (RNN) in natural language processing tasks. This paper provides intensive comparisons of its performance with that of RNN for speech applications; automatic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS).
One of the major difficulties when applying Transformer to speech applications is that it requires more complex configurations (e.g., optimizer, network structure, data augmentation) than the conventional RNN based models. Our goal is to share our knowledge on the use of Transformer in speech tasks so that the community can fully succeed our exciting outcomes.
Currently, existing Transformer-based speech applications [speech-transformer, CrossVila2018, li2019close] still lack an open source toolkit and reproducible experiments while previous studies in NMT [ott-etal-2018-scaling, tensor2tensor-W18-1819]
provide them. Therefore, we work on an open community-driven project for end-to-end speech applications using both Transformer and RNN by following the success of Kaldi for hidden Markov model (HMM)-based ASR[kaldi]. Specifically, our experiments provide practical guides for tuning Transformer in speech tasks to achieve state-of-the-art results.
In our speech application experiments, we investigate several aspects of Transformer and RNN-based systems. For example, we measure the word/character/regression error from the ground truth, training curve, and scalability for multiple GPUs.
The contributions of this work are:
We conduct a larges-scale comparative study on Transformer and RNN with significant performance gains especially for the ASR related tasks.
We explain our training tips for Transformer in speech applications: ASR, TTS and ST.
We provide reproducible end-to-end recipes and models pretrained on a large number of publicly available datasets111After the double-blind review, our recipes will be available at https://github.com/espnet/espnet.
As Transformer was originally proposed as an NMT system [VaswaniNIPS2017_7181]
, it has been widely studied on NMT tasks including hyperparameter search[DBLP:journals/pbml/PopelB18], parallelism implementation [ott-etal-2018-scaling] and in comparison with RNN [lakew-etal-2018-comparison]. On the other hand, speech processing tasks have just provided their preliminary results in ASR [speech-transformer, Zhou2018], ST [CrossVila2018] and TTS [li2019close]. Therefore, this paper aims to gather the previous basic research and to explore wider topics (e.g., accuracy, speed, training tips) in our experiments.
2 Sequence-to-sequence RNN
2.1 Unified formulation for S2S
S2S is a variant of neural networks that learns to transform a source sequence to a target sequence [s2s_NIPS2014_5346]. In Fig. 1, we illustrate a common S2S structure for ASR, TTS and ST tasks. S2S consists of two neural networks: an encoder
and a decoder
where is the source sequence (e.g., a sequence of speech features (for ASR and ST) or characters (for TTS)), is the number of layers in EncBody, is the number of layers in DecBody, is a target frame index, and all the functions in the above equations are implemented by neural networks. For the decoder input , we use a ground-truth prefix in the training stage, while we use a generated prefix in the decoding stage. During training, the S2S model learns to minimize the scalar loss value
between the generated sequence and the target sequence .
The remainder of this section describes RNN-based universal modules: “EncBody” and “DecBody”. We regard “EncPre”, “DecPre”, “DecPost” and “Loss” as task-specific modules and we describe them in the later sections.
2.2 RNN encoder
in Eq. (2) transforms a source sequence into an intermediate sequence . Existing RNN-based implementations [Bahdanau15, Chan2016, DBLP:conf/icassp/ShenPWSJYCZWRSA18]
typically adopt a bi-directional long short-term memory (BLSTM) that can perform such an operation thanks to its recurrent connection. For ASR, an encoded sequencecan also be used for source-level frame-wise prediction using connectionist temporal classification (CTC) [ctc-DBLP:conf/icml/GravesFGS06] for joint training and decoding [hori2018end].
2.3 RNN decoder
in Eq. (4) generates a next target frame with the encoded sequence and the prefix of target prefix . For sequence generation, the decoder is mostly unidirectional. For example, uni-directional LSTM with an attention mechanism [Bahdanau15] is often used in RNN-based implementations. That attention mechanism emits source frame-wise weights to sum the encoded source frames
as a target frame-wise vector to be transformed with the prefix. We refer to this type of attention as “source attention”.
Transformer learns sequential information via a self-attention mechanism instead of the recurrent connection employed in RNN. This section describes the self-attention based modules in Transformer in detail.
3.1 Multi-head attention
Transformer consists of multiple dot-attention layers [luong-dot-att-D15-1166]:
where and are inputs for this attention layer, is the number of feature dimensions, is the length of , and is the length of and . We refer to as the “attention matrix”. Vaswani et al. [VaswaniNIPS2017_7181] considered these inputs and to be a query and a set of key-value pairs, respectively.
In addition, to allow the model to deal with multiple attentions in parallel, Vaswani et al. [VaswaniNIPS2017_7181] extended this attention layer in Eq. (7) to multi-head attention (MHA):
where and are inputs for this MHA layer, is the -th attention layer output (), and are learnable weight matrices and is the number of attentions in this layer.
3.2 Self-attention encoder
where is the index of encoder layers, and is the -th two-layer feedforward network:
where is the -th frame of the input sequence , are learnable weight matrices, and
are learnable bias vectors. We refer toin Eq. (3.2) as “self attention”.
3.3 Self-attention decoder
Transformer-based used for Eq. (4) consists of two attention modules:
where is the index of the decoder layers. We refer to the attention matrix between the decoder input and the encoder output in as “source attention’ as same as the one in RNN in Sec 2.3. Because the unidirectional decoder is useful for sequence generation, its attention matrices at the -th target frame are masked so that they do not connect with future frames later than . This masking of the sequence can be done in parallel using an elementwise product with a triangular binary matrix. Because it requires no sequential operation, it provides a faster implementation than RNN.
3.4 Positional encoding
To represent the time location in the non-recurrent model, Transformer adopts sinusoidal positional encoding:
The input sequences are concatenated with before and modules.
4 ASR extensions
In our ASR framework, the S2S predicts a target sequence of characters or SentencePiece [kudo-richardson-2018-sentencepiece] from an input sequence of log-mel filterbank speech features.
4.1 ASR encoder architecture
The source in ASR is represented as a sequence of 83-dim log-mel filterbank frames with pitch features [kaldi-pitch]. First, transforms the source sequence into a subsampled sequence
by using two-layer CNN with 256 channels, stride size 2 and kernel size 3 in[speech-transformer]
, or VGG-like max pooling in[HoriWZC17], where is the length of the output sequence of the CNN. This CNN corresponds to in Eq. (1). Then, transforms into a sequence of encoded features for the CTC and decoder networks.
4.2 ASR decoder architecture
The decoder network receives the encoded sequence and the prefix of a target sequence of token IDs: characters or SentencePiece [kudo-richardson-2018-sentencepiece]. First, in Eq. (3) embeds the tokens into learnable vectors. Next, and single-linear layer predicts the posterior distribution of the next token prediction given and .
4.3 ASR training and decoding
During ASR training, both the decoder and the CTC module predict the frame-wise posterior distribution of given corresponding source : and , respectively. We simply use the weighted sum of those negative log likelihood values:
where is a hyperparameter.
In the decoding stage, the decoder predicts the next token given the speech feature and the previous predicted tokens using beam search, which combines the scores of S2S, CTC and the RNN language model (LM) [Mikolov2010] as follows:
where is a set of hypotheses of the target sequence, and are hyperparameters.
5 ST extensions
In ST, S2S receives the same source speech feature and target token sequences in ASR but the source and target languages are different. Its modules are also defined in the same ways as in ASR. However, ST cannot cooperate with the CTC module introduced in Section 4.3 because the translation task does not guarantee the monotonic alignment of the source and target sequences unlike ASR [Weiss2017].
6 TTS extensions
In the TTS framework, the S2S generates a sequence of log-mel filterbank features and predicts the probabilities of the end of sequence (EOS) given an input character sequence[DBLP:conf/icassp/ShenPWSJYCZWRSA18].
6.1 TTS encoder architecture
The input of the encoder in TTS is a sequence of IDs corresponding to the input characters and the EOS symbol. First, the character ID sequence is converted into a sequence of character vectors with an embedding layer, and then the positional encoding scaled by a learnable scalar parameter is added to the vectors [li2019close]. This process is a TTS implementation of in Eq. (1). Finally, the encoder in Eq. (2) transforms this input sequence into a sequence of encoded features for the decoder network.
6.2 TTS decoder architecture
The inputs of the decoder in TTS are a sequence of encoder features and a sequence of log-mel filterbank features. In training, ground-truth log-mel filterbank features are used with an teacher-forcing manner while in inference, predicted ones are used with an autoregressive manner.
First, the target sequence of 80-dim log-mel filterbank features is converted into a sequence of hidden features by Prenet [DBLP:conf/icassp/ShenPWSJYCZWRSA18] as a TTS implementation of in Eq. (3
units. Since it is expected that the hidden representations converted by Prenet are located in the similar feature space to that of encoder features, Prenet helps to learn a diagonal source attention[li2019close]. Then the decoder in Eq. (4), whose architecture is the same as the encoder, transforms the sequence of encoder features and that of hidden features into a sequence of decoder features. Two linear layers are applied for each frame of to calculate the target feature and the probability of the EOS, respectively. Finally, Postnet [DBLP:conf/icassp/ShenPWSJYCZWRSA18]
is applied to the sequence of predicted target features to predict its components in detail. Postnet is a five-layer CNN, each layer of which is a 1d convolution with 256 channels and a kernel size of 5 followed by batch normalization, a tanh activation function, and dropout. These modules are a TTS implementation ofin Eq. (5).
6.3 TTS training and decoding
In TTS training, the whole network is optimized to minimize two loss functions in TTS; 1) L1 loss for the target features and 2) binary cross entropy (BCE) loss for the probability of the EOS. To address the issue of class imbalance in the calculation of the BCE, a constant weight (e.g. 5) is used for a positive sample[li2019close].
Additionally, we apply a guided attention loss [tachibana2018efficiently] to accelerate the learning of diagonal attention to only the two heads of two layers from the target side. This is because it is known that the source attention matrices are diagonal in only certain heads of a few layers from the target side [li2019close]. We do not introduce any hyperparameters to balance the three loss values. We simply add them all together.
In inference, the network predicts the target feature of the next frame in an autoregressive manner. And if the probability of the EOS becomes higher than a certain threshold (e.g. 0.5), the network will stop the prediction.
7 ASR Experiments
In Table 1, we summarize the 15 datasets we used in our ASR experiment. Our experiment covered various topics in ASR including recording (clean, noisy, far-field, etc), language (English, Japanese, Mandarin Chinese, Spanish, Italian) and size (10 - 960 hours). Except for JSUT [jsut] and Fisher-CALLHOME Spanish, our data preparation scripts are based on Kaldi’s “s5x” recipe [kaldi]
. Technically, we tuned all the configurations (e.g., feature extraction, SentencePiece[kudo-richardson-2018-sentencepiece], language modeling, decoding, data augmentation [park2019specaugment, ko2015audio]) except for the training stage to their optimum in the existing RNN-based system. We used data augmentation for several corpora. For example, we applied speed perturbation [ko2015audio] at ratio 0.9, 1.0 and 1.1 to CSJ, CHiME4, Fisher-CALLHOME Spanish, HKUST, and TED-LIUM2/3, and we also applied SpecAugment [park2019specaugment] to Aurora4, LibriSpeech, TED-LIUM2/3 and WSJ.222We chose datasets to apply these data augmentation methods by preliminary experiments with our RNN-based system.
|AISHELL [aishell]||zh||170||read||dev / test|
|AURORA4 [pearce2002aurora] (*)||en||15||noisy read||(dev_0330) A / B / C / D|
|CSJ [CSJ-L00-1200]||ja||581||spontaneous||eval1 / eval2 / eval3|
|CHiME4 [chime3] (*)||en||108||noisy far-field multi-ch read||dt05_simu / dt05_real / et05_simu / et05_real|
|CHiME5 [chime5]||en||40||noisy far-field multi-ch conversational||dev_worn / kinect|
|Fisher-CALLHOME Spanish||es||170||telephone conversational||dev / dev2 / test / devtest / evltest|
|HKUST [hkust]||zh||200||telephone conversational||dev|
|JSUT [jsut]||ja||10||read||(our split)|
|LibriSpeech [LibriSpeech]||en||960||clean/noisy read||dev_clean / dev_other / test_clean / test_other|
|REVERB [reverb] (*)||en||124||far-field multi-ch read||et_near / et_far|
|SWITCHBOARD [swbd]||en||260||telephone conversational||eval2000 / RT’03|
|TED-LIUM2 [TED-LIUM/ROUSSEAU12.698]||en||118||spontaneous||dev / test|
|TED-LIUM3 [tedlium3]||en||452||spontaneous||dev / test|
|VoxForge [voxforge]||it||16||read||(our split)|
|WSJ [wsjPaul:1992:DWS:1075527.1075614]||en||81||read||dev93 / eval92|
We adopted the same architecture for Transformer ( introduced in Section 3) for every corpus except for the largest, LibriSpeech (). For RNN, we followed our existing best architecture configured on each corpus as in previous studies [hori2018end, Zeyer2018].
Transformer requires a different optimizer configuration from RNN because Transformer’s training iteration is eight times faster and its update is more fine-grained than RNN. For RNN, we followed existing best systems for each corpus using Adadelta [adadelta] with early stopping. To train Transformer, we basically followed the previous literature [speech-transformer]
(e.g., dropout, learning rate, warmup steps). We did not use development sets for early stopping in Transformer. We simply ran 20 – 200 epochs (mostly 100 epochs) and averaged the model parameters stored at the last 10 epochs as the final model.
We conducted our training on a single GPU for larger corpora such as LibriSpeech, CSJ and TED-LIUM3. We also confirmed that the emulation of multiple GPUs using accumulating gradients over multiple forward/backward steps [ott-etal-2018-scaling] could result in similar performance with those corpora. In the decoding stage, Transformer and RNN share the same configuration for each corpus, for example, beam size (e.g., 20 – 40), CTC weight (e.g., 0.3), and LM weight (e.g., 0.3 – 1.0) introduced in Section 4.3.
summarizes the ASR results in terms of character/word error rate (CER/WER) on each corpora. It shows that Transformer outperforms RNN on 13/15 corpora in our experiment. Although our system has no lexicon (e.g., pronunciation dictionary, part-of-speech tag), our Transformer provides comparable CER/WERs to the HMM-based system, Kaldi on 7/12 corpora. We conclude that Transformer has ability to outperform the RNN-based end-to-end system and the DNN/HMM-based system even in low resource (JSUT), large resource (LibriSpeech, CSJ), noisy (AURORA4) and far-field (REVERB) tasks. Table3 also summarizes the Librispeech ASR benchmark with ours and other reports, and our transformer results are comparable to the best performance in [irie2019language, luscher2019rwth, park2019specaugment].
Fig. 2 shows an ASR training curve obtained with multiple GPUs on LibriSpeech. We observed that Transformer trained with a larger minibatch became more accurate while RNN did not. On the other hand, when we use a smaller minibatch for Transformer, it typically became under-fitted after the warmup steps. In this task, Transformer achieved the best accuracy provided by RNN about eight times faster than RNN with a single GPU.
|dataset||token||error||Kaldi||Our RNN||Our Transformer|
|AISHELL||char||CER||N/A / 7.4||6.8 / 8.0||6.0 / 6.7|
|AURORA4||char||WER||(*) 3.6 / 7.7 / 10.0 / 22.3||3.5 / 6.4 / 5.1 / 12.3||3.3 / 6.0 / 4.5 / 10.6|
|CSJ||char||CER||(*) 7.5 / 6.3 / 6.9||6.6 / 4.8 / 5.0||5.7 / 4.1 / 4.5|
|CHiME4||char||WER||6.8 / 5.6 / 12.1 / 11.4||9.5 / 8.9 / 18.3 / 16.6||9.6 / 8.2 / 15.7 / 14.5|
|CHiME5||char||WER||47.9 / 81.3||59.3 / 88.1||60.2 / 87.1|
|Fisher-CALLHOME Spanish||char||WER||N/A||27.9 / 27.8 / 25.4 / 47.2 / 47.9||27.0 / 26.3 / 24.4 / 45.3 / 46.2|
|LibriSpeech||BPE||WER||3.9 / 10.4 / 4.3 / 10.8||3.1 / 9.9 / 3.3 / 10.8||2.2 / 5.6 / 2.6 / 5.7|
|REVERB||char||WER||18.2 / 19.9||24.1 / 27.2||15.5 / 19.0|
|SWITCHBOARD||BPE||WER||18.1 / 8.8||28.5 / 15.6||18.1 / 9.0|
|TED-LIUM2||BPE||WER||9.0 / 9.0||11.2 / 11.0||9.3 / 8.1|
|TED-LIUM3||BPE||WER||6.2 / 6.8||14.3 / 15.0||9.7 / 8.0|
|VoxForge||char||CER||N/A||12.9 / 12.6||9.4 / 9.1|
|WSJ||char||WER||4.3 / 2.3||7.0 / 4.7||6.8 / 4.4|
|RWTH (E2E) [irie2019language]||2.9||8.8||3.1||9.8|
|RWTH (HMM) [luscher2019rwth]||2.3||5.2||2.7||5.7|
|Google SpecAug. [park2019specaugment]||N/A||N/A||2.5||5.8|
We summarize the training tips we observed in our experiment:
When Transformer suffers from under-fitting, we recommend increasing the minibatch size because it also results in a faster training time and better accuracy simultaneously unlike any other hyperparameters.
The accumulating gradient strategy [ott-etal-2018-scaling] can be adopted to emulate the large minibatch if multiple GPUs are unavailable.
While dropout did not improve the RNN results, it is essential for Transformer to avoid over-fitting.
We tried several data augmentation methods [ko2015audio, park2019specaugment]. They greatly improved both Transformer and RNN.
The best decoding hyperparameters for RNN are generally the best for Transformer.
Transformer’s weakness is decoding. It is much slower than Kaldi’s system because the self-attention requires in a naive implementation, where is the speech length. To directly compare the performance with DNN-HMM based ASR systems, we need to develop a faster decoding algorithm for Transformer.
8 Multilingual ASR Experiments
This section compares the ASR performance of RNN and Transformer in a multilingual setup given the success of Transformer for the monolingual ASR tasks in the previous section. In accordance with [watanabe2017language], we prepared 10 different languages, namely WSJ (English), CSJ (Japanese) [CSJ-L00-1200], HKUST [hkust] (Mandarin Chinese), and VoxForge (German, Spanish, French, Italian, Dutch, Portuguese, Russian). The model is based on a single multilingual model, where the parameters are shared across all the languages and whose output units include the graphemes of all 10 languages (totally 5,297 graphemes and special symbols). We used a default setup for both RNN and Transformer introduced in Section 7.2 without RNNLM shallow fusion [HoriWZC17].
Figure 3 clearly shows that our Transformer significantly outperformed our RNN in 9 languages. It realized a more than 10% relative improvement in 8 languages and with the largest value of 28.0% for relative improvement in VoxForge Italian. When compared with the RNN result reported in [watanabe2017language], which used a deeper BLSTM (7 layer) and RNNLM, our Transformer still provided superior performance in 9 languages. From this result, we can conclude that Transformer also outperforms RNN in multilingual end-to-end ASR.
9 Speech Translation Experiments
Our baseline end-to-end ST RNN is based on [Weiss2017], which is similar to the RNN structure used in our ASR system, but we did not use a convolutional LSTM layer in the original paper. The configuration of our ST Transformer was the same as that of our ASR system.
We conducted our ST experiment on the Fisher-CALLHOME English–Spanish corpus [post2013improved]. Our Transformer improved the BLEU score to 17.2 from our RNN baseline BLEU 16.5 on the CALLHOME “evltest” set. While training Transformer, we observed more serious under-fitting than with RNN. The solution for this is to use the pretrained encoder from our ASR experiment since the ST dataset contains Fisher-CALLHOME Spanish corpus used in our ASR experiment.
10 TTS Experiments
Our baseline RNN-based TTS model is Tacotron 2 [DBLP:conf/icassp/ShenPWSJYCZWRSA18]. We followed its model and optimizer setting. We reuse existing TTS recipes including those for data preparation and waveform generation that we configured to be the best for RNN. We configured our Transformer-based configurations introduced in Section 3 as follows: . The input for both systems was the sequence of characters.
We compared Transformer and RNN based TTS using two corpora: M-AILABS [mailabs] (Italian, 16 kHz, 31 hours) and LJSpeech [ljspeech17] (English, 22 kHz, 24 hours). A single Italian male speaker (Riccardo) was used in the case of M-AILABS. Figures 4 and 5 show training curves in the two corpora. In these figures, Transformer and RNN provide similar L1 loss convergence. As seen in ASR, we observed that a larger minibatch results in better validation L1 loss for Transformer and faster training, while it has a detrimental effect on the L1 loss for RNN. We also provide generated speech mel-spectrograms in Fig. 6 and 7333Our audio samples generated by Tacotron 2, Transformer, and FastSpeech are available at https://bit.ly/329gif5. We conclude that Transformer-based TTS can achieve almost the same performance as RNN-based.
Our lessons learned when training Transformer in TTS are as follows:
It is possible to accelerate TTS training by using a large minibatch as well as ASR if a lot of GPUs are available.
The validation loss value, especially BCE loss, could be over-fitted more easily with Transformer. We recommend monitoring attention maps rather than the loss when checking its convergence.
Some heads of attention maps in Transformer are not always diagonal as found with Tacotron 2. We needed to select where to apply the guided attention loss [tachibana2018efficiently].
Decoding filterbank features with Transformer is also slower than with RNN (6.5 ms vs 78.5 ms per frame, on CPU w/ single thread). We also tried FastSpeech [fastspeech], which realizes non-autoregressive Transformer-based TTS. It greatly improves the decoding speed (0.6 ms per frame, on CPU w/ single thread) and generates comparable quality of speech with the autoregressive Transformer.
A reduction factor introduced in [Wang2017] was also effective for Transformer. It can greatly reduce training and inference time but slightly degrades the quality.
As future work, we need further investigation of the trade off between training speed and quality, and the introduction of ASR techniques (e.g., data augmentation, speech enhancement) for TTS.
We presented a comparative study of Transformer and RNN in speech applications with various corpora, namely ASR (15 monolingual + one multilingual), ST (one corpus), and TTS (two corpora). In our experiments on these tasks, we obtained the promising results including huge improvements in many ASR tasks and explained how we improved our models. We believe that the reproducible recipes, pretrained models and training tips described in this paper will accelerate Transformer research directions on speech applications.