Voice synthesis systems, also known as Text-To-Speech systems (TTS systems), are becoming popular in standalone form or as an auxiliary feature in other systems. For example, speech synthesis models are common as audio interface in applications such as the Siri (siri), Cortana (cortana) and Alexa (alexa). However, according to dctts, speech synthesis systems that require the design of every processing step, what we refer to as "traditional" speech synthesis systems, may pose challenges that put in question their viable use in a general setting, as we detail below.
The traditional speech synthesis systems takes text as input and outputs speech, having generically two main processing blocks: the Linguistic-Prosodic Processing is the first block that receives the text and delivers a prosodic-phonetic representation (teixeira2005evaluation) to the second main block, which implements the Acoustic Processing to convert a sequence of phonemes and prosodic information into a speech signal.
The Linguistic-Prosodic block has a preliminary pre-processing text to convert symbols, numbers, abbreviations and acronyms into full text, followed by some level of linguistic analysis retrieving grammatical function and accent, as well as prosodic marks. Then, a phonetic transcription (text-to-phoneme) converts text into a sequence of phonetic representations, which requires to incorporate some prosodic structure to the speech using a durations model to define the length of each speech segment and insert pauses between speech chunks (teixeira2003segmental). A Fundamental Frequency model is also used to determine the fundamental frequency (F0) contour (teixeira2003prediction).
The Acoustic Processing block receives the phoneme representation sequence, the prosodic information about phoneme segments length and the F0 contour and computes the speech signal. Several models have been used to implement this block, such as the classical formants model (klatt1980software), Linear Prediction Coefficients (LPC) model, the Pitch Synchronous Overlap and Add (PSOLA) models (charpentier1986diphone)
largely used in several TTS engines like Microsoft Speech API. In addition, Hidden Markov Models (HMM) based synthesis are still under investigation(tokuda2000speech), as well as a variety of Unit Selection Models (wang2011automatic).
Deep Learning (deeplrgoodfellow) allows to integrate all processing steps into a single model and connect them directly from the input text to the synthesized audio output, which is referred to as end-to-end learning. While neural models are sometimes criticized as difficult to interpret, several end-to-end trained speech synthesis systems (tacotron; tacotron2; dctts; deepvoice3; kyle2017char2wav)
were shown to be able to estimate spectrograms from text inputs with promising performances. Due to the sequential characteristic of text and audio data, recurrent units were the standard building blocks for speech synthesis such as in Tacotron 1 and 2(tacotron; tacotron2). Recently, convolutional layers showed good performance while reducing computational costs as implemented in DeepVoice3 and DCTTS (Deep Convolutional Text To Speech) methods dctts; deepvoice3.
However, all current models for TTS are designed for the English language (tacotron2; deepvoice3). Although English and Brazilian Portuguese are fusional languages, Brazilian Portuguese is a morphologically-rich language (MRL). MRLs express information concerning the grammatical function of a word and its grammatical relation to other words at the word level, via inflectional affixes and pronominal clitics (tsarfaty-etal-2013-parsing), therefore MRLs exhibit high type-to-token ratios, i.e. more lexical diversity. Therefore, many applications would benefit from having a better understanding on how Deep Learning models would behave and which architecture and training strategy is more adequate in Portuguese.
Therefore, in this paper we compare several models for TTS in the literature considering both performance and synthesis quality. The experiments were carried out using Portuguese language and were based on a single speaker TTS. For that, we created a new public dataset including 10.5-hours of audio. Our contributions are two-fold (i) a thorough and detailed experimental analysis considering different models, vocoders, phoneme tools, preprocessing and investigating transfer-learning from English to Portuguese, (ii) a novel public dataset with over 10 hours of recorded speech in Portuguese.
Our results and discussions shed light on the matter of training end-to-end methods for a non-English language, in particular Portuguese and made available the first public dataset and trained model for this language.
2 Voice Synthesis Approaches
With the advent of deep learning, speech synthesis systems have evolved greatly, and are still being intensively studied. Models based on Recurrent Neural Networks such as Tacotron(tacotron), Tacotron 2 (tacotron2), Deep Voice 1 (deepvoice) and Deep Voice 2 (deepvoice2) have gained prominence, but as these models use recurrent layers they have high computational cost. This has led to the development of fully convolutional models such as DCTTS (dctts) and Deep Voice 3 (deepvoice3), which sought to reduce computational cost while maintaining good synthesis quality.
deepvoice3 proposed a totally convolutional model for speech synthesis, they also compared the use of three different vocoders, being Griffin-Lim (griffin1984signal), WORLD Vocoder (world) and WaveNet (WaveNet). The results indicated that WaveNet neural vocoder provided a more natural waveform synthesis. However, WORLD vocodered was preferable even when WaveNet leads to a more natural audio, due to its shorter runtime. The authors further compared the proposed model (Deep Voice 3) with the Tacotron (tacotron) and Deep Voice 2 (deepvoice2) models.
proposed the DCTTS model, a fully convolutional model, consisting of two neural networks. The first, called Text2Mel (text to Mel spectrogram), which aims to generate a Mel spectrogram from an input text and the second, Spectrogram Super-resolution Network (SSRN), which converts a Mel spectrogram to the STFT (Short-time Fourier Transform) spectrogram(benesty2011speech). DCTTS consists of only convolutional layers and uses dilated convolution (yu2015multi; kalchbrenner2016neural) to take long, contextual information into account. DCTTS uses the vocoder RTISI-LA (Real-Time Iterative Spectrogram Inversion with Look-Ahead) (zhu2007real), which is an adaptation of the Griffin-Lim vocoder (griffin1984signal), which aims to increase the speed of the synthesis by slightly sacrificing the quality of the audio generated.
Tacotron 1 (tacotron) proposes the use of a single trained end-to-end Deep neural network. The model includes an encoder and a decoder. It uses an attention mechanism (bahdanau2014neural) and also includes post-processing module. This models use convolutional filters, skip connections (srivastava2015highway)
, and Gated Recurrent Units (GRUs)(chung2014empirical)neurons. Tacotron also uses Griffin-Lim (griffin1984signal) algorithm to convert the STFT spectrogram to the wave form.
Tacotron 2 (tacotron2) combines Tacotron 1 with modified wavenet vocoder (wavenetmodificado). Tacotron 2 is composed of a recurrent network of prediction resources from sequence to sequence that maps the incorporation of characters in Mel spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize waveforms in the domain of time from those spectrograms. They also demonstrated that the use of Mel spectrograms as the conditioning input for WaveNet, instead of linguistic characteristics, allows a significant reduction in the size of the WaveNet architecture. Mozilla TTS (ttsmozilla) is an implementation of both Tacotron 1 and 2 and uses has some differences during the training process, like the use use of weight decay and phonetic transcriptions.
To explore the synthesis of voice in the proposed language, it was first necessary to create an audio dataset for Brazilian Portuguese. Although there are some public resources of speech databases for European Portuguese teixeira2001phonetic with a low amount of speech, approximately 100 minutes, thus preventing its use for training models based on deep learning. In addition, the differences in the spoken Brazilian Portuguese requires the use of a dataset in Brazilian Portuguese.
In Brazilian Portuguese there are few publicly available resources and does not have open source bases for voice synthesis. The dataset is described in Section3.1 and we call it TTS-Portuguese Corpus111Official repository: https://github.com/Edresson/TTS-Portuguese-Corpus.
The performed experiments are described in the Section 3.2, in which we explored the synthesis of voice using models of prominence in the literature. We chose the models DCTTS (dctts), Tacotron 1 (tacotron) and Mozilla TTS (ttsmozilla). We still explored the use of two different vocoders, Griffin-Lim/RTISI-LA (griffin1984signal; zhu2007real) and WaveRNN (wavernn).
3.1 TTS-Portuguese Corpus
To create the audio dataset, public domain texts were used. Texts were extracted from Wikipedia articles displayed in the Highlights section. Texts were also extracted from the Chatterbot-corpus222Guithub: https://github.com/gunthercox/chatterbot-corpus/tree/master/chatterbot_corpus/data/portuguese, a corpus originally created for the construction of chatbots. In addition, we used 20 sets of phonetically balanced phrases, each set containing 10 phrases proposed by seara1994estudo. The total number of words is , with distinct words.
The audio base developed in this work has 10 hours and 28 minutes of speech from a single speaker, recorder at 48 Khz sampling frequency and 32 bits of resolution, containing a total of 3,632 audio files in Wave format. Audio files range in length from 0.67 to 50.08 seconds and contain a single sentence each.
Since the audios were not recorded in an acoustic studio, there is noise present in some files, therefore we chose to use a noise suppression library. We used the library RNNoise (valin2017hybrid). It is based on Recurrent Neural Networks; more specifically Gated Recurrent Unit (cho2014learning) and demonstrated good performance for noise suppression. The audio base is open source, and publicly available under the terms of the license Creative Commons Attribution 4.0 (CC BY 4.0). 333Official repository: https://github.com/Edresson/TTS-Portuguese-Corpus.
Experiments over this corpus can be performed as follow. The original written texts or their respective phonetically transcribed versions are used as input, while the audio spectrogram are the expected output.
Here, we compare the models DCTTS, Tacotron 1 and Mozilla TTS. To maintain reproducible results, we used open source model implementations and tried to replicate related works as faithfully as possible. However, some works do not fully specify the hyper-parameters used, therefore we had to estimate good hyper-parameters for some models in order to get reasonable results on our dataset. As an example, in the DCTTS paper authors do not specify how normalization was applied in their model (dctts; opendctts). Furthermore, we used a voice dataset specially designed for the Portuguese language. Such dateset is small when compared with traditional databases tested for English.
We have used the following implementations: DCTTS provided by opendctts; WaveRNN provided by opendwavernn; Universal WaverRNN (lorenzo2018towards) provided by universalwavernn; Tacotron 1 provided by opentacotron; Mozilla TTS, based on Tacotron 2, and provided by ttsmozilla. The proposed experiments are grouped according to similarity. The experiments and groups are listed below.
Group 1: this experiment group replicates the implementation of the DCTTS model, training the model for Portuguese language with the TTS-Portuguese Corpus in its original output form (without phonetic transcriptions). Experiment 1.1 did not use noise suppression. Experiment 1.2 used RNNoise library, described in Section 3.1. Experiment 1.3 is similar to 1.1, but used transfer learning from a pre-trained model444https://github.com/Kyubyong/dc_tts over the LJ Speech Corpus555https://keithito.com/LJ-Speech-Dataset/ during 800 thousand steps. Experiment 1.4 is similar to 1.3, but used RNNoise library. The default vocoder, RTISI-LA, was used in all experiments in this group. In order to obtain good results we tested different normalization options and decided to use, in all layers, 5 % dropout and layer normalization (ba2016layer). We did not use a fixed learning rate as described in the original article. Instead, we used a starting learning rate of 0.001 decaying using Noam’s learning rate decay scheme (vaswani2017attention). Although Group 1 experiments replicate the DCTTS model (dctts), in order for the model to converge, it was necessary to use layer normalization and dropout in all layers. Some adjustments in the texts were made during transfer learning in order to map characters of Portuguese language, e.g. “á”, to the allowed character range in DCCTS pre-trained model.
Group 2: this experiment group is similar to group 1 but phonetic transcriptions are used as input instead of the original texts, either with noise suppression (Experiment 2.2) or without it (Experiment 2.1). Petrus 2.0 (PhonEtic TRanscriber for User Support) (serrani2015ambiente), a G2P system conversion, was used for converting texts to the International Phonetic Alphabet (IPA) format.
Group 3: this experiment group explores a modification in the DCTTS original model in which vocoders WaveRNN (wavernn) and Universal WaveRNN (universalwavernn) are used instead of the default (RTISI-LA). In Experiment 3.1, the Module SSRN is adjusted to generate Mel spectrogram rather than the full STFT spectrogram. After that, the Mel spectrogram is presented to the WaveRNN vocoder and no noise suppression is applied. Experiment 3.2 is similar, but using RNNoise as a denoiser. Experiments 3.3 and 3.4 are similar to 3.1 and 3.2, respectively, but changing the vocoder to Universal WaveRNN. In this group, one of DCTTS modules was modified (SSRN – Spectrogram Super-resolution Network) so network was changed to predict Mel spectrogram instead of the full STFT spectrogram.
Group 4: this experiment group applied phonetic transcriptions in a similar way to group 2, but using the tool Phonemizer666https://github.com/bootphon/phonemizer/ for phonetic transcriptions instead of Petrus 2.0. Both vocoders WaveRNN (4.1 and 4.2) and Universal WaveRNN (4.3 and 4.4) were analyzed, either without (4.1 and 4.3) or with (4.2 and 4.4) denoising. In this group, the DCTTS model was also modified to synthesize a Mel spectrogram instead of the STFT spectrogram.
Group 5: this experiment group is based on Tacotron 1 (tacotron; opentacotron). Experiment 5.1 did not converge well in our tests, even when normalization layer is used. Therefore, Experiment 5.2 explores the use of transfer learning to facilitate the training process. Experiment 5.3 is similar, but uses noise suppression. Experiments 5.2 and 5.3 used an English pre-trained model777https://github.com/Kyubyong/tacotron over 200 thousand steps. This model is based on the LJ Speech Corpus.
Group 6: this experiment explores the use of the Mozilla TTS either without (6.1 and 6.3) or with (6.2 and 6.4) denoising. Experiments 6.3 and 6.4 also were based on transfer learning, based on a pre-trained model888https://github.com/mozilla/TTS/tree/ljspeech-tacotron-iter-185K for the English language. The pre-trained model was trained over 185 thousand steps in LJ Speech corpus. In order to illustrate transfer learning speed-up, these experiments use less training steps than the others.
Group 7: this experiment group explores the use of the Mozilla TTS model with two vocoders: RTISI-LA vocoder (7.1 and 7.2) and the Universal WaveRNN vocoder (universalwavernn) (7.3 and 7.4). Both vocoders were tested with (7.2 and 7.4) and without (7.1 and 7.3) denoising. It is important to note that 7.1 and 7.2 are similar to 6.3 and 6.4, however, their purpose is to show the full capacity of transfer learning. Therefore, the model is trained for more steps.
The Table 1 summarizes the proposed experiments detailing the model used, the vocoder, the tool used for the phonetic transcription, whether the experiment uses transfer learning from English language or not and the use of the RNNoise noise suppression library.
|Experiment||Model||Vocoder||Phoneme Tools||Transfer Learning||RNNoise|
|7.3||Mozilla TTS||Universal WaveRNN||Phonemizer||Yes||No|
|7.4||Mozilla TTS||Universal WaveRNN||Phonemizer||Yes||Yes|
Groups 1, 2, and 4 experiments use the same loss functions for training. Two parts of the model are trained separately. The first trains the Text2Mel model and its loss is composed of three functions: binary cross-entropy, L1(deeplrgoodfellow) and guided attention loss (dctts). The second part is responsible to transform a Mel spectrogram in the complete STFT spetrogram and apply super-resolution in the process. The loss is composed of the functions L1 and binary cross-entropy. In experiments in group 3, the same loss functions used in groups 1, 2 and 4 are used for Text2Mel and SSRN. However, vocoder WaveRNN uses the Gaussian Loss function (ping2018clarinet), which allows for negative values as outputs.
In groups 5, 6 and 7, no guided attention is used. Therefore the loss function did not include the cost of attention. Since the network is trained end-to-end, the loss depends on the output of two network modules. The first module converts text into Mel spectrogram. The second module is a SSRN-like module called CBHG (1-D Convolution Bank Highway Network Bidirectional Gated Recurrent Unit).
For the WaveRNN model, we used, as output, the raw audios and as loss, the gaussian loss, as proposed in the Clarinet vocoder (ping2018clarinet), where the approach provided an improvement compared to other techniques. For Universal WaveRNN, the authors opted to use the discretized mixture of logistic distributions as loss, introduced by salimans2017pixelcnn, as applied in the speech synthesis work of oord2017parallel.
Table 2 shows the hardware specifications of the equipment used for model training. The experiments belonging to groups 1, 2 and 4 were trained on computer 2 while the other experiments were performed using computer 1.
|Specifications||Computer 1||Computer 2|
|RAM memory||16 GB||32 GB|
|Video Card||Nvidia GeForce Gtx Titan V||Nvidia GeForce Gtx 1080 TI|
|Operational system||Ubuntu 18.04||Windows 10|
presents the training data of the experiments. Some experiments are grouped since they use the same model across a group. The metrics presented in the table are: number of training epochs, the time required for training, and loss in the final training epoch. It can be observed a variation in the required epochs even for experiments using the same model. This was performed considering different convergence rates for each experiment or experiment group. For example, it can be observed that Group 2 requires less epochs due to faster convergence rate since phonetic transcriptions were used. It is important to note that group 1 through 4 are trained in two phases, both reported in the table: Mel spectrogram and full STFT spectrogram.
|Experiment Group||Training steps||Time||Loss|
|Experiment 1.1 and 1.2 (Text2Mel/SSRN)||2115k/2019k||4d19h/5d22h||0.5134/0.4537|
|Experiment 1.3 and 1.4 (Text2Mel/SSRN)||1387k/2019k||3d4h /5d22h||0.52245/0.4537|
|Group 2 (Text2Mel/SSRN)||1734k/2019k||4d20h/5d22h||0.4975/0.4537|
|Experiment 3.1 and 3.2 (Text2Mel/SSRN/||2653k/2688k/||5d14h/6d3h/||0.51/0.529/|
|Experiment 5.2 and 5.3||57k||23h||0.0905|
|Experiment 6.1 and 6.2||135k||6d2h||0.1017|
|Experiment 6.3 and 6.4||70k||2d22h||0.1012|
Different losses are used, making it difficult to directly compare models. However, most of the experiments from the same group can be compared, as the same loss is used in each experiment group, with the exception of Group 3, which used different losses for WaveRNN and Universal WaveRNN.
4 Results and Discussion
In order to analyze the results, two model evaluation schemes were proposed: a preliminary analysis, described in Section 4.1, using three loss related measures; an Mean Opinion Score (MOS) analysis (mos), described in Section 4.2, performed on the models that presented the best results (or the more interesting) in the preliminary analysis. For performing the analyses, it is proposed the synthesis of sentences extracted from two phonetically balanced sets (with 10 sentences each) taken from the work proposed by seara1994estudo. Models, synthesized audios, corpus and interactive demo are public available999omitted due to blind review.
4.1 Preliminary Analysis
The evaluation the generated audios is performed by comparing synthetized spectrograms with reference spectrograms according to a given measure. For the preliminary evaluation, we used the measures L1 loss (deeplrgoodfellow), Root Mean Square Error (RMSE) (tamamori2017speaker) and the Dynamic Time Warping (DTW) distance (muller2007dynamic). It may be difficult to compare two audios with the same sentence, as they may not align perfectly due to the sentence starting at different times in each audio and/or having different rates (keogh2001derivative). It is important to note that RMSE and L1 does not account for temporal alignment in the audios. DTW, on the other hand, is a technique that allows to efficiently align sequences of predicted and expected audio spectrograms.
Table 4 presents the three measures for each proposed experiment including a ranking for each model. The evaluation was performed comparing the STFT spectrograms extracted from the synthesized audio and from the reference audio, using the same sentences for synthesis and evaluation.
|1.1||0.2299 (05)||0.2889 (05)||49.8286 (06)|
|1.2||0.2307 (06)||0.2904 (08)||49.5054 (04)|
|1.3||0.2692 (14)||0.3400 (16)||56.4664 (08)|
|1.4||0.2673 (13)||0.3368 (15)||56.6633 (09)|
|2.1||0.2317 (08)||0.2909 (09)||58.3463 (11)|
|2.2||0.2311 (07)||0.2902 (07)||58.0234 (10)|
|3.1||0.4218 (23)||0.4746 (23)||48.9505 (03)|
|3.2||0.2955 (17)||0.3486 (17)||124.648 (23)|
|3.3||0.3729 (20)||0.4189 (20)||62.8290 (13)|
|3.4||0.1709 (03)||0.2170 (03)||74.5801 (17)|
|4.1||0.3735 (21)||0.4208 (21)||62.1867 (12)|
|4.2||0.1840 (04)||0.2099 (02)||89.6465 (22)|
|5.1||0.4166 (22)||0.4632 (22)||48.6379 (02)|
|5.2||0.2372 (09)||0.2890 (06)||71.2294 (15)|
|5.3||0.2387 (10)||0.2935 (10)||41.3604 (01)|
|6.1||0.2745 (16)||0.3334 (13)||71.7627 (16)|
|6.2||0.2741 (15)||0.3335 (14)||70.8759 (14)|
|6.3||0.3104 (18)||0.3724 (18)||81.6769 (21)|
|6.4||0.3112 (19)||0.3736 (19)||81.5592 (20)|
|7.1||0.2438 (11)||0.2966 (11)||76.7543 (18)|
|7.2||0.2447 (12)||0.2976 (12)||77.2355 (19)|
|7.3||0.1699 (02)||0.2189 (04)||49.5566 (05)|
|7.4||0.1620 (01)||0.2058 (01)||50.1933 (07)|
According to the L1 measure the best experiments were 7.4, 7.3, 3.4 and 4.2 and showing a slight preference over Mozillla TTS. The four first positions consisted on Mozillla TTS and DCTTS models, where a Mozillla TTS variant (7.4) performed slightly better than the others. These two models performed reasonably better than Tacotron 1, whose best experiment is 5.2. RMSE results were similar to the L1 measure, favoring, in order, 7.4, 4.2, 3.4 and 7.3. Again, the two models performed better than Tacotron 1 in a similar fashion (5.2 is the sixth in the rank). DTW measure, on the other hand, favors Tacotron models, as the first two best results are 5.3 and 5.1. DCTTS is also well evaluated in this measure, having the third and fourth (3.1 and 1.2) best results. It can be observed fewer experiments using Tacotron 1 than the other two. This happened due to difficulties making Tacotron 1 to converge in our base. This preliminary result suggests Tacotron 1 is not as good as the other two in the limited size dataset scenario. Indeed, although we tested other variants from the original Tacotron 1 experiments, only 5.1 to 5.3 were reported since they only reasonably converged.
The natural question that arises from these results is whether this measure (L1 measure) correctly reflects human perceived quality of the synthesized audios. These audios were then manually validated by a human. During the validation, we observed that none of these measures are totally accurate to reflect the human perception of the synthesis quality. However, L1 and RMSE were considered close enough in this analysis, while DTW was not ideal for this task, as it tends to give betters scores to perceptually worse audios. Thus, L1 was used to select the best (or the more interesting) experiments and perform a MOS-based detailed evaluation, as discussed in Section 4.2.
The preliminary analysis was also useful to investigate the role of the vocoder, the effectiveness of transfer learning, and the impact of denoising and phonetic transcriptions. Regarding the vocoders, Universal WaveRNN performed better according to L1, as its variants occupy the first three positions in the ranking. RTISI-LA also performed well, having models from the fourth until the ninth ranks. WaveRNN did not perform well in the ranking, but this is explained by the fact that our base is relatively small and we did not have access to a pre-trained version of this model as we did with Universal WaveRNN. Nevertheless, this does not affect RTISI-LA as it is not a neural model.
Regarding transfer learning, both human validation and L1 results showed that models using the technique obtained a boost in their performances, which can be seen comparing experiment 5.1 (no transfer learning) with 5.2 and 5.3. In fact, experiment 5.1 does not present a good convergence as the generated audios are very close to random noise. In group 6, transfer learning reduced the training time, although having a negative effect on L1 results, possibly demanding a slightly increase in training epochs in the experiments. Overall, transfer learning allows to train models in substantially less time, as can be seem in group 5 and 6 than a day that would take several days to have similar performance. Nevertheless, it is possible to train longer (as group 7) to maximize the metrics.
Regarding the conversion to phonetic transcriptions, three of the best four experiments used Phonemizer, indicating phonetic transcriptions have a role in reducing the burden of the learning task. Similarly, among the ten best experiments, only three do not use transcriptions.
Finally, regarding denoising, the ranking suggests a medium effect on the best models. Denoising is used in three of the best four, or alternatively, six of the best ten. However, the impact seems bigger for experiments lower positioned in the ranking, since the last four do not apply denoising.
4.2 Mean Opinion Score Analysis
The experiments 7.4, 7.2, 3.4 and 5.2 were chosen according to L1 results obtained in the preliminary analysis. The first choice (Mozilla TTS) is the best experiment according to L1. The second choice is 7.2 rather than 7.3 in order to investigate a different vocoder (RTISI-LA rather than Universal WaveRNN). Experiment 7.2 also was preferred over 7.1 since it does not use denoising, allowing for a fair comparison. The third choice was 3.4, since it presented the best results among DCTTS models. Finally, the fourth experiment was 5.2 since it has the best L1 among the Tacotron 1 models.
presents the MOS values for the selected models and their respective 95% confidence intervals, which can also be seem in Figure1. For the analysis, 20 phonetically balanced sentences (seara1994estudo) were synthetized for each selected model, resulting in 80 audios. Additionally, 20 audios from the original speaker were added as ground truth. Thus, each evaluator analyzed 100 sentences. The MOS was then calculated with 20 annotators, using the methodology described by mos.
|Ground truth||4.7125 ±0.167696 (-)|
|3.4||3.0325 ±0.343858 (3)|
|5.2||2.7850 ±0.334755 (4)|
|7.2||4.0250 ±0.278179 (1)|
|7.4||3.4675 ±0.335458 (2)|
The results of the main analysis indicate that experiment 7.2 (Mozilla TTS) presented the best MOS value (4.02). This experiment uses RTISI-LA vocoder, phonetic transcriptions (Phonemizer), with transfer learning and denoising. According to mos, the obtained value indicates a good quality audio, having a just perceptible, but not annoying, distortion.
Following, 7.4 is in the second position in the rank (MOS 3.46). This model is very similar to 7.2, but uses Universal WaveRNN as a vocoder. The results suggest RTISI-LA as the preferred vocoder in the studied scenario. The observed MOS indicates a perceptible and slightly annoying distortion in the audios.
Experiment 3.4 (MOS 3.03) is the third in the ranking, belonging in the same category as 7.4 according to mos. Finally, experiment 5.2 resulted in 2.785 MOS, which indicates an annoying audio, but not unpleasant.
It is possible to compare our results with related works. ttspropor reached 3.82 ±0.69 MOS when training Tacotron 2 in their best experiment, based on female voices for European Portuguese using a closed dataset, with approximately 14 hours. As Mozilla TTS is closely related to Tacotron 2, this result is comparable to our 4.02 ±0.29 result in Experiment 7.2. Regarding the English processing, the work from (tacotron2) (Tacontron 2) can also be compared to experiment 7.2. The authors trained their model over US English dataset (24.6 hours), reaching a MOS of 4.52, while we had a close result (4.02) with a dataset having less than half the size (10.5 hours).
Similarly, our Tacotron 1 model reached a MOS of 2.78, while the work by tacotron, reached 3.82 also using US English dataset. The difference is higher, but can be possibly explained by the difference in the datasets.
Finally, regarding DCTTS, we reached a 3.03 MOS, while the original authors (dctts) had 2.71 on a larger dataset, the English LJ Speech Dataset (with approximately, 24 hours) (ito2017lj). These results suggest our dataset presents a good quality for TTS training.
This work presented an open dataset as well as the training of four speech synthesizer models based on deep learning, applied to the Brazilian Portuguese language. The dataset is public available and contains approximately 10.5 hours. Besides the models, we also tested other important parameters such three vocoders, phonetic transcription dictionaries/systems, transfer learning (English to Portuguese) and denoising.
We found that it is possible to train good quality voice synthesizer for Portuguese using our dataset, which allows to reach a 4.02 MOS value. In the studied scenario, the best experiment is based on Mozilla TTS and RTISI-LA vocoder. It uses phonetic transcriptions (using Phonemizer), transfer learning and denoising.
Overall, transfer Learning from English allowed a fast convergence in training. Denoising and phonetic transcriptions were considered useful for improve the results. The results we obtained are comparable to other works in the literature, even though using a dataset that is less than half of the size of datasets usually used in English language.
To the best of our knowledge, this is the first public available dataset for the language. Similarly, the trained models are a contribution to the Portuguese language as it has limited open access models based on deep learning.
We would like to thank the Coordination for the Improvement of Higher Education Personnel (Coordenação de Aperfeiçoamento de Pessoal de Nível Superior — CAPES101010https://www.capes.gov.br/) and Itaipu Technological Park (Parque Tecnológico Itaipu — PTI111111https://www.pti.org.br/) for financial support, especially from the Latin American Center of Open Technologies (Centro Latino-Americano de Tecnologias Abertas - CELTAB). We also gratefully acknowledge the support of NVIDIA corporation with the donation of the GPU used in part of the experiments presented in this research.