Log In Sign Up

A Comparative Study on Transformer vs RNN in Speech Applications

by   Shigeki Karita, et al.

Sequence-to-sequence models have been widely used in end-to-end speech processing, for example, automatic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS). This paper focuses on an emergent sequence-to-sequence model called Transformer, which achieves state-of-the-art performance in neural machine translation and other natural language processing applications. We undertook intensive studies in which we experimentally compared and analyzed Transformer and conventional recurrent neural networks (RNN) in a total of 15 ASR, one multilingual ASR, one ST, and two TTS benchmarks. Our experiments revealed various training tips and significant performance benefits obtained with Transformer for each task including the surprising superiority of Transformer in 13/15 ASR benchmarks in comparison with RNN. We are preparing to release Kaldi-style reproducible recipes using open source and publicly available datasets for all the ASR, ST, and TTS tasks for the community to succeed our exciting outcomes.


EEG based Continuous Speech Recognition using Transformers

In this paper we investigate continuous speech recognition using electro...

Joint Learning of Correlated Sequence Labelling Tasks Using Bidirectional Recurrent Neural Networks

The stream of words produced by Automatic Speech Recognition (ASR) syste...

Recent Developments on ESPnet Toolkit Boosted by Conformer

In this study, we present recent developments on ESPnet: End-to-End Spee...

Attentive Sequence-to-Sequence Learning for Diacritic Restoration of Yorùbá Language Text

Yorùbá is a widely spoken West African language with a writing system ri...

An Augmented Transformer Architecture for Natural Language Generation Tasks

The Transformer based neural networks have been showing significant adva...

Foreign English Accent Adjustment by Learning Phonetic Patterns

State-of-the-art automatic speech recognition (ASR) systems struggle wit...

Estimating articulatory movements in speech production with transformer networks

We estimate articulatory movements in speech production from different m...

1 Introduction

Transformer is a sequence-to-sequence (S2S) architecture originally proposed for neural machine translation (NMT) [VaswaniNIPS2017_7181] that rapidly replaces recurrent neural networks (RNN) in natural language processing tasks. This paper provides intensive comparisons of its performance with that of RNN for speech applications; automatic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS).

One of the major difficulties when applying Transformer to speech applications is that it requires more complex configurations (e.g., optimizer, network structure, data augmentation) than the conventional RNN based models. Our goal is to share our knowledge on the use of Transformer in speech tasks so that the community can fully succeed our exciting outcomes.

Currently, existing Transformer-based speech applications [speech-transformer, CrossVila2018, li2019close] still lack an open source toolkit and reproducible experiments while previous studies in NMT [ott-etal-2018-scaling, tensor2tensor-W18-1819]

provide them. Therefore, we work on an open community-driven project for end-to-end speech applications using both Transformer and RNN by following the success of Kaldi for hidden Markov model (HMM)-based ASR 

[kaldi]. Specifically, our experiments provide practical guides for tuning Transformer in speech tasks to achieve state-of-the-art results.

In our speech application experiments, we investigate several aspects of Transformer and RNN-based systems. For example, we measure the word/character/regression error from the ground truth, training curve, and scalability for multiple GPUs.

The contributions of this work are:

  • We conduct a larges-scale comparative study on Transformer and RNN with significant performance gains especially for the ASR related tasks.

  • We explain our training tips for Transformer in speech applications: ASR, TTS and ST.

  • We provide reproducible end-to-end recipes and models pretrained on a large number of publicly available datasets111After the double-blind review, our recipes will be available at

Related studies

As Transformer was originally proposed as an NMT system [VaswaniNIPS2017_7181]

, it has been widely studied on NMT tasks including hyperparameter search 

[DBLP:journals/pbml/PopelB18], parallelism implementation [ott-etal-2018-scaling] and in comparison with RNN [lakew-etal-2018-comparison]. On the other hand, speech processing tasks have just provided their preliminary results in ASR [speech-transformer, Zhou2018], ST [CrossVila2018] and TTS [li2019close]. Therefore, this paper aims to gather the previous basic research and to explore wider topics (e.g., accuracy, speed, training tips) in our experiments.

2 Sequence-to-sequence RNN

2.1 Unified formulation for S2S

S2S is a variant of neural networks that learns to transform a source sequence to a target sequence  [s2s_NIPS2014_5346]. In Fig. 1, we illustrate a common S2S structure for ASR, TTS and ST tasks. S2S consists of two neural networks: an encoder


and a decoder


where is the source sequence (e.g., a sequence of speech features (for ASR and ST) or characters (for TTS)), is the number of layers in EncBody, is the number of layers in DecBody, is a target frame index, and all the functions in the above equations are implemented by neural networks. For the decoder input , we use a ground-truth prefix in the training stage, while we use a generated prefix in the decoding stage. During training, the S2S model learns to minimize the scalar loss value


between the generated sequence and the target sequence .

The remainder of this section describes RNN-based universal modules: “EncBody” and “DecBody”. We regard “EncPre”, “DecPre”, “DecPost” and “Loss” as task-specific modules and we describe them in the later sections.

2.2 RNN encoder

in Eq. (2) transforms a source sequence into an intermediate sequence . Existing RNN-based implementations [Bahdanau15, Chan2016, DBLP:conf/icassp/ShenPWSJYCZWRSA18]

typically adopt a bi-directional long short-term memory (BLSTM) that can perform such an operation thanks to its recurrent connection. For ASR, an encoded sequence

can also be used for source-level frame-wise prediction using connectionist temporal classification (CTC) [ctc-DBLP:conf/icml/GravesFGS06] for joint training and decoding [hori2018end].

2.3 RNN decoder

in Eq. (4) generates a next target frame with the encoded sequence and the prefix of target prefix . For sequence generation, the decoder is mostly unidirectional. For example, uni-directional LSTM with an attention mechanism [Bahdanau15] is often used in RNN-based implementations. That attention mechanism emits source frame-wise weights to sum the encoded source frames

as a target frame-wise vector to be transformed with the prefix

. We refer to this type of attention as “source attention”.

Figure 1: Sequence-to-sequence architecture in speech applications.

3 Transformer

Transformer learns sequential information via a self-attention mechanism instead of the recurrent connection employed in RNN. This section describes the self-attention based modules in Transformer in detail.

3.1 Multi-head attention

Transformer consists of multiple dot-attention layers [luong-dot-att-D15-1166]:


where and are inputs for this attention layer, is the number of feature dimensions, is the length of , and is the length of and . We refer to as the “attention matrix”. Vaswani et al. [VaswaniNIPS2017_7181] considered these inputs and to be a query and a set of key-value pairs, respectively.

In addition, to allow the model to deal with multiple attentions in parallel, Vaswani et al. [VaswaniNIPS2017_7181] extended this attention layer in Eq. (7) to multi-head attention (MHA):


where and are inputs for this MHA layer, is the -th attention layer output (), and are learnable weight matrices and is the number of attentions in this layer.

3.2 Self-attention encoder

We define Transformer-based used for Eq. (2) unlike the RNN encoder in Section 2.2 as follows:


where is the index of encoder layers, and is the -th two-layer feedforward network:


where is the -th frame of the input sequence , are learnable weight matrices, and

are learnable bias vectors. We refer to

in Eq. (3.2) as “self attention”.

3.3 Self-attention decoder

Transformer-based used for Eq. (4) consists of two attention modules:


where is the index of the decoder layers. We refer to the attention matrix between the decoder input and the encoder output in as “source attention’ as same as the one in RNN in Sec 2.3. Because the unidirectional decoder is useful for sequence generation, its attention matrices at the -th target frame are masked so that they do not connect with future frames later than . This masking of the sequence can be done in parallel using an elementwise product with a triangular binary matrix. Because it requires no sequential operation, it provides a faster implementation than RNN.

3.4 Positional encoding

To represent the time location in the non-recurrent model, Transformer adopts sinusoidal positional encoding:


The input sequences are concatenated with before and modules.

4 ASR extensions

In our ASR framework, the S2S predicts a target sequence of characters or SentencePiece [kudo-richardson-2018-sentencepiece] from an input sequence of log-mel filterbank speech features.

4.1 ASR encoder architecture

The source in ASR is represented as a sequence of 83-dim log-mel filterbank frames with pitch features [kaldi-pitch]. First, transforms the source sequence into a subsampled sequence

by using two-layer CNN with 256 channels, stride size 2 and kernel size 3 in 


, or VGG-like max pooling in 

[HoriWZC17], where is the length of the output sequence of the CNN. This CNN corresponds to in Eq. (1). Then, transforms into a sequence of encoded features for the CTC and decoder networks.

4.2 ASR decoder architecture

The decoder network receives the encoded sequence and the prefix of a target sequence of token IDs: characters or SentencePiece [kudo-richardson-2018-sentencepiece]. First, in Eq. (3) embeds the tokens into learnable vectors. Next, and single-linear layer predicts the posterior distribution of the next token prediction given and .

4.3 ASR training and decoding

During ASR training, both the decoder and the CTC module predict the frame-wise posterior distribution of given corresponding source : and , respectively. We simply use the weighted sum of those negative log likelihood values:


where is a hyperparameter.

In the decoding stage, the decoder predicts the next token given the speech feature and the previous predicted tokens using beam search, which combines the scores of S2S, CTC and the RNN language model (LM) [Mikolov2010] as follows:


where is a set of hypotheses of the target sequence, and are hyperparameters.

5 ST extensions

In ST, S2S receives the same source speech feature and target token sequences in ASR but the source and target languages are different. Its modules are also defined in the same ways as in ASR. However, ST cannot cooperate with the CTC module introduced in Section 4.3 because the translation task does not guarantee the monotonic alignment of the source and target sequences unlike ASR [Weiss2017].

6 TTS extensions

In the TTS framework, the S2S generates a sequence of log-mel filterbank features and predicts the probabilities of the end of sequence (EOS) given an input character sequence 


6.1 TTS encoder architecture

The input of the encoder in TTS is a sequence of IDs corresponding to the input characters and the EOS symbol. First, the character ID sequence is converted into a sequence of character vectors with an embedding layer, and then the positional encoding scaled by a learnable scalar parameter is added to the vectors [li2019close]. This process is a TTS implementation of in Eq. (1). Finally, the encoder in Eq. (2) transforms this input sequence into a sequence of encoded features for the decoder network.

6.2 TTS decoder architecture

The inputs of the decoder in TTS are a sequence of encoder features and a sequence of log-mel filterbank features. In training, ground-truth log-mel filterbank features are used with an teacher-forcing manner while in inference, predicted ones are used with an autoregressive manner.

First, the target sequence of 80-dim log-mel filterbank features is converted into a sequence of hidden features by Prenet [DBLP:conf/icassp/ShenPWSJYCZWRSA18] as a TTS implementation of in Eq. (3

). This network consists of two linear layers with 256 units, a ReLU activation function, and dropout followed by a projection linear layer with

units. Since it is expected that the hidden representations converted by Prenet are located in the similar feature space to that of encoder features, Prenet helps to learn a diagonal source attention 

[li2019close]. Then the decoder in Eq. (4), whose architecture is the same as the encoder, transforms the sequence of encoder features and that of hidden features into a sequence of decoder features. Two linear layers are applied for each frame of to calculate the target feature and the probability of the EOS, respectively. Finally, Postnet [DBLP:conf/icassp/ShenPWSJYCZWRSA18]

is applied to the sequence of predicted target features to predict its components in detail. Postnet is a five-layer CNN, each layer of which is a 1d convolution with 256 channels and a kernel size of 5 followed by batch normalization, a tanh activation function, and dropout. These modules are a TTS implementation of

in Eq. (5).

6.3 TTS training and decoding

In TTS training, the whole network is optimized to minimize two loss functions in TTS; 1) L1 loss for the target features and 2) binary cross entropy (BCE) loss for the probability of the EOS. To address the issue of class imbalance in the calculation of the BCE, a constant weight (e.g. 5) is used for a positive sample 


Additionally, we apply a guided attention loss [tachibana2018efficiently] to accelerate the learning of diagonal attention to only the two heads of two layers from the target side. This is because it is known that the source attention matrices are diagonal in only certain heads of a few layers from the target side [li2019close]. We do not introduce any hyperparameters to balance the three loss values. We simply add them all together.

In inference, the network predicts the target feature of the next frame in an autoregressive manner. And if the probability of the EOS becomes higher than a certain threshold (e.g. 0.5), the network will stop the prediction.

7 ASR Experiments

7.1 Dataset

In Table 1, we summarize the 15 datasets we used in our ASR experiment. Our experiment covered various topics in ASR including recording (clean, noisy, far-field, etc), language (English, Japanese, Mandarin Chinese, Spanish, Italian) and size (10 - 960 hours). Except for JSUT [jsut] and Fisher-CALLHOME Spanish, our data preparation scripts are based on Kaldi’s “s5x” recipe [kaldi]

. Technically, we tuned all the configurations (e.g., feature extraction, SentencePiece 

[kudo-richardson-2018-sentencepiece], language modeling, decoding, data augmentation [park2019specaugment, ko2015audio]) except for the training stage to their optimum in the existing RNN-based system. We used data augmentation for several corpora. For example, we applied speed perturbation [ko2015audio] at ratio 0.9, 1.0 and 1.1 to CSJ, CHiME4, Fisher-CALLHOME Spanish, HKUST, and TED-LIUM2/3, and we also applied SpecAugment [park2019specaugment] to Aurora4, LibriSpeech, TED-LIUM2/3 and WSJ.222We chose datasets to apply these data augmentation methods by preliminary experiments with our RNN-based system.

dataset language hours speech test sets
AISHELL [aishell] zh 170 read dev / test
AURORA4 [pearce2002aurora] (*) en 15 noisy read (dev_0330) A / B / C / D
CSJ [CSJ-L00-1200] ja 581 spontaneous eval1 / eval2 / eval3
CHiME4 [chime3] (*) en 108 noisy far-field multi-ch read dt05_simu / dt05_real / et05_simu / et05_real
CHiME5 [chime5] en 40 noisy far-field multi-ch conversational dev_worn / kinect
Fisher-CALLHOME Spanish es 170 telephone conversational dev / dev2 / test / devtest / evltest
HKUST [hkust] zh 200 telephone conversational dev
JSUT [jsut] ja 10 read (our split)
LibriSpeech [LibriSpeech] en 960 clean/noisy read dev_clean / dev_other / test_clean / test_other
REVERB [reverb] (*) en 124 far-field multi-ch read et_near / et_far
SWITCHBOARD [swbd] en 260 telephone conversational eval2000 / RT’03
TED-LIUM2 [TED-LIUM/ROUSSEAU12.698] en 118 spontaneous dev / test
TED-LIUM3 [tedlium3] en 452 spontaneous dev / test
VoxForge [voxforge] it 16 read (our split)
WSJ [wsjPaul:1992:DWS:1075527.1075614] en 81 read dev93 / eval92
Table 1: ASR dataset description. Names listed in “test sets” correspond to ASR results in Table 2. We enlarged corpora marked with (*) by the external WSJ train_si284 dataset (81 hours).

7.2 Settings

We adopted the same architecture for Transformer ( introduced in Section 3) for every corpus except for the largest, LibriSpeech (). For RNN, we followed our existing best architecture configured on each corpus as in previous studies [hori2018end, Zeyer2018].

Transformer requires a different optimizer configuration from RNN because Transformer’s training iteration is eight times faster and its update is more fine-grained than RNN. For RNN, we followed existing best systems for each corpus using Adadelta [adadelta] with early stopping. To train Transformer, we basically followed the previous literature [speech-transformer]

(e.g., dropout, learning rate, warmup steps). We did not use development sets for early stopping in Transformer. We simply ran 20 – 200 epochs (mostly 100 epochs) and averaged the model parameters stored at the last 10 epochs as the final model.

We conducted our training on a single GPU for larger corpora such as LibriSpeech, CSJ and TED-LIUM3. We also confirmed that the emulation of multiple GPUs using accumulating gradients over multiple forward/backward steps [ott-etal-2018-scaling] could result in similar performance with those corpora. In the decoding stage, Transformer and RNN share the same configuration for each corpus, for example, beam size (e.g., 20 – 40), CTC weight (e.g., 0.3), and LM weight (e.g., 0.3 – 1.0) introduced in Section 4.3.

7.3 Results

Table 2

summarizes the ASR results in terms of character/word error rate (CER/WER) on each corpora. It shows that Transformer outperforms RNN on 13/15 corpora in our experiment. Although our system has no lexicon (e.g., pronunciation dictionary, part-of-speech tag), our Transformer provides comparable CER/WERs to the HMM-based system, Kaldi on 7/12 corpora. We conclude that Transformer has ability to outperform the RNN-based end-to-end system and the DNN/HMM-based system even in low resource (JSUT), large resource (LibriSpeech, CSJ), noisy (AURORA4) and far-field (REVERB) tasks. Table 

3 also summarizes the Librispeech ASR benchmark with ours and other reports, and our transformer results are comparable to the best performance in [irie2019language, luscher2019rwth, park2019specaugment].

Fig. 2 shows an ASR training curve obtained with multiple GPUs on LibriSpeech. We observed that Transformer trained with a larger minibatch became more accurate while RNN did not. On the other hand, when we use a smaller minibatch for Transformer, it typically became under-fitted after the warmup steps. In this task, Transformer achieved the best accuracy provided by RNN about eight times faster than RNN with a single GPU.

Figure 2: ASR training curve with LibriSpeech dataset.
dataset token error Kaldi Our RNN Our Transformer
AISHELL char CER N/A / 7.4 6.8 / 8.0 6.0 / 6.7
AURORA4 char WER (*) 3.6 / 7.7 / 10.0 / 22.3 3.5 / 6.4 / 5.1 / 12.3 3.3 / 6.0 / 4.5 / 10.6
CSJ char CER (*) 7.5 / 6.3 / 6.9 6.6 / 4.8 / 5.0 5.7 / 4.1 / 4.5
CHiME4 char WER 6.8 / 5.6 / 12.1 / 11.4 9.5 / 8.9 / 18.3 / 16.6 9.6 / 8.2 / 15.7 / 14.5
CHiME5 char WER 47.9 / 81.3 59.3 / 88.1 60.2 / 87.1
Fisher-CALLHOME Spanish char WER N/A 27.9 / 27.8 / 25.4 / 47.2 / 47.9 27.0 / 26.3 / 24.4 / 45.3 / 46.2
HKUST char CER 23.7 27.4 23.5
JSUT char CER N/A 20.6 18.7
LibriSpeech BPE WER 3.9 / 10.4 / 4.3 / 10.8 3.1 / 9.9 / 3.3 / 10.8 2.2 / 5.6 / 2.6 / 5.7
REVERB char WER 18.2 / 19.9 24.1 / 27.2 15.5 / 19.0
SWITCHBOARD BPE WER 18.1 / 8.8 28.5 / 15.6 18.1 / 9.0
TED-LIUM2 BPE WER 9.0 / 9.0 11.2 / 11.0 9.3 / 8.1
TED-LIUM3 BPE WER 6.2 / 6.8 14.3 / 15.0 9.7 / 8.0
VoxForge char CER N/A 12.9 / 12.6 9.4 / 9.1
WSJ char WER 4.3 / 2.3 7.0 / 4.7 6.8 / 4.4
Table 2: ASR results of char/word error rates. Results marked with (*) were evaluated in our environment because the official results were not provided. Kaldi official results were retrieved from the version “c7876a33”.
dev_clean dev_other test_clean test_other
RWTH (E2E) [irie2019language] 2.9 8.8 3.1 9.8
RWTH (HMM) [luscher2019rwth] 2.3 5.2 2.7 5.7
Google SpecAug. [park2019specaugment] N/A N/A 2.5 5.8
Our Transformer 2.2 5.6 2.6 5.7
Table 3: Comparison of the Librispeech ASR benchmark

7.4 Discussion

We summarize the training tips we observed in our experiment:

  • When Transformer suffers from under-fitting, we recommend increasing the minibatch size because it also results in a faster training time and better accuracy simultaneously unlike any other hyperparameters.

  • The accumulating gradient strategy [ott-etal-2018-scaling] can be adopted to emulate the large minibatch if multiple GPUs are unavailable.

  • While dropout did not improve the RNN results, it is essential for Transformer to avoid over-fitting.

  • We tried several data augmentation methods [ko2015audio, park2019specaugment]. They greatly improved both Transformer and RNN.

  • The best decoding hyperparameters for RNN are generally the best for Transformer.

Transformer’s weakness is decoding. It is much slower than Kaldi’s system because the self-attention requires in a naive implementation, where is the speech length. To directly compare the performance with DNN-HMM based ASR systems, we need to develop a faster decoding algorithm for Transformer.

8 Multilingual ASR Experiments

This section compares the ASR performance of RNN and Transformer in a multilingual setup given the success of Transformer for the monolingual ASR tasks in the previous section. In accordance with [watanabe2017language], we prepared 10 different languages, namely WSJ (English), CSJ (Japanese) [CSJ-L00-1200], HKUST [hkust] (Mandarin Chinese), and VoxForge (German, Spanish, French, Italian, Dutch, Portuguese, Russian). The model is based on a single multilingual model, where the parameters are shared across all the languages and whose output units include the graphemes of all 10 languages (totally 5,297 graphemes and special symbols). We used a default setup for both RNN and Transformer introduced in Section 7.2 without RNNLM shallow fusion [HoriWZC17].

Figure 3: Comparison of multilingual end-to-end ASR with the RNN in Watanabe et al. [watanabe2017language], our RNN, and our Transformer.

Figure 3 clearly shows that our Transformer significantly outperformed our RNN in 9 languages. It realized a more than 10% relative improvement in 8 languages and with the largest value of 28.0% for relative improvement in VoxForge Italian. When compared with the RNN result reported in [watanabe2017language], which used a deeper BLSTM (7 layer) and RNNLM, our Transformer still provided superior performance in 9 languages. From this result, we can conclude that Transformer also outperforms RNN in multilingual end-to-end ASR.

9 Speech Translation Experiments

Our baseline end-to-end ST RNN is based on [Weiss2017], which is similar to the RNN structure used in our ASR system, but we did not use a convolutional LSTM layer in the original paper. The configuration of our ST Transformer was the same as that of our ASR system.

We conducted our ST experiment on the Fisher-CALLHOME English–Spanish corpus [post2013improved]. Our Transformer improved the BLEU score to 17.2 from our RNN baseline BLEU 16.5 on the CALLHOME “evltest” set. While training Transformer, we observed more serious under-fitting than with RNN. The solution for this is to use the pretrained encoder from our ASR experiment since the ST dataset contains Fisher-CALLHOME Spanish corpus used in our ASR experiment.

10 TTS Experiments

10.1 Settings

Our baseline RNN-based TTS model is Tacotron 2 [DBLP:conf/icassp/ShenPWSJYCZWRSA18]. We followed its model and optimizer setting. We reuse existing TTS recipes including those for data preparation and waveform generation that we configured to be the best for RNN. We configured our Transformer-based configurations introduced in Section 3 as follows: . The input for both systems was the sequence of characters.

10.2 Results

We compared Transformer and RNN based TTS using two corpora: M-AILABS [mailabs] (Italian, 16 kHz, 31 hours) and LJSpeech [ljspeech17] (English, 22 kHz, 24 hours). A single Italian male speaker (Riccardo) was used in the case of M-AILABS. Figures  4 and 5 show training curves in the two corpora. In these figures, Transformer and RNN provide similar L1 loss convergence. As seen in ASR, we observed that a larger minibatch results in better validation L1 loss for Transformer and faster training, while it has a detrimental effect on the L1 loss for RNN. We also provide generated speech mel-spectrograms in Fig. 6 and 7333Our audio samples generated by Tacotron 2, Transformer, and FastSpeech are available at We conclude that Transformer-based TTS can achieve almost the same performance as RNN-based.

Figure 4: TTS training curve on M-AILABS.
Figure 5: TTS training curve on LJSpeech.

10.3 Discussion

Our lessons learned when training Transformer in TTS are as follows:

  • It is possible to accelerate TTS training by using a large minibatch as well as ASR if a lot of GPUs are available.

  • The validation loss value, especially BCE loss, could be over-fitted more easily with Transformer. We recommend monitoring attention maps rather than the loss when checking its convergence.

  • Some heads of attention maps in Transformer are not always diagonal as found with Tacotron 2. We needed to select where to apply the guided attention loss [tachibana2018efficiently].

  • Decoding filterbank features with Transformer is also slower than with RNN (6.5 ms vs 78.5 ms per frame, on CPU w/ single thread). We also tried FastSpeech [fastspeech], which realizes non-autoregressive Transformer-based TTS. It greatly improves the decoding speed (0.6 ms per frame, on CPU w/ single thread) and generates comparable quality of speech with the autoregressive Transformer.

  • A reduction factor introduced in [Wang2017] was also effective for Transformer. It can greatly reduce training and inference time but slightly degrades the quality.

As future work, we need further investigation of the trade off between training speed and quality, and the introduction of ASR techniques (e.g., data augmentation, speech enhancement) for TTS.

Figure 6: Samples of mel-spectrograms on M-AILABs. (top) ground-truth, (middle) Tacotron 2 sample, (bottom) Transformer sample. The input text is “E PERCHÈ SUBITO VIENE IN MENTE CHE IDDIO NON PUÒ AVER FATTO UNA COSA INGIUSTA”.
Figure 7: Samples of mel-spectrograms on LJSpeech. (top) ground-truth, (middle) Tacotron 2 sample, (bottom) Transformer sample. The input text is “IS NOT CONSISTENT WITH THE STANDARDS WHICH THE RESPONSIBILITIES OF THE SECRET SERVICE REQUIRE IT TO MEET.”.

11 Summary

We presented a comparative study of Transformer and RNN in speech applications with various corpora, namely ASR (15 monolingual + one multilingual), ST (one corpus), and TTS (two corpora). In our experiments on these tasks, we obtained the promising results including huge improvements in many ASR tasks and explained how we improved our models. We believe that the reproducible recipes, pretrained models and training tips described in this paper will accelerate Transformer research directions on speech applications.

12 References